I'm looking for a limited form of version control in the database here:
- Size is of greatest importance: many revisions of the same file should occupy the smalles开发者_如何转开发t space possible (I'm not looking for compression since the data is already compressed)
- Computational requirements are secondary
- I should be able to fetch the current revision of the document as fast as possible (fetching older versions is not time-critical)
Basically answers should contain at least two things:
- What binary diff algorithm would you use?
- How would you structure this system in a way specific to PostreSQL?
"Size is of greatest importance": how about an external diff tool (like bsdiff?) using PL/sh for example.
"I should be able to fetch the current revision of the document as fast as possible": In which case you will want to do your diff the 'wrong' way round, so each revision would involve:
- replace 'previous revision' with diff between 'new revision' and 'previous revision'
- add 'new revision'
To get back to an old revision would then require iteratively applying previous diffs as patches until you get to the revision you need.
Whatever you do, I think you will need to uncompress the data first before using the diff tool. Here's why:
dd if=/dev/urandom of=myfile.1 bs=1024 count=10
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
cp myfile.1 tmp; cat tmp >> myfile.1
dd if=/dev/urandom of=myfile.2 bs=1024 count=10
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cp myfile.2 tmp; cat tmp >> myfile.2
cat myfile.1 >> myfile.2
bsdiff myfile.1 myfile.2 diff
gzip -c myfile.1 > myfile.1.gz
gzip -c myfile.2 > myfile.2.gz
bsdiff myfile.1.gz myfile.2.gz gz.diff
rm tmp
ls -l
-rw-r--r-- 1 root root 17115 2011-04-05 10:54 diff
-rw-r--r-- 1 root root 21580 2011-04-05 10:54 gz.diff
-rw-r--r-- 1 root root 163840 2011-04-05 10:54 myfile.1
-rw-r--r-- 1 root root 11709 2011-04-05 10:54 myfile.1.gz
-rw-r--r-- 1 root root 327680 2011-04-05 10:54 myfile.2
-rw-r--r-- 1 root root 23399 2011-04-05 10:54 myfile.2.gz
Note that gz.diff
is larger than diff
- if you try this with real files I expect the difference to be even larger.
I tend to really dislike re-inventing the wheel. In the case of storage space optimization people way smarter than me figured out solutions already. I'd prefer, when possible to leverage the hard work of these really smart people. With that said I might consider looking into storing my files in a revision control system such as Mercurial or Git,once I understand how they store binary data. Once you figure out which one you want to use you can look at ways of creating some stored functions most likely in pl/perl or one similar that can interact with the version control system and bridge the gap between your relation data in PostgreSQL and the binary files.
My only issue with this approach is that I don't really like that I took a transactional system and have introduced an outside system(Mercurial/Git) into it. And on top of that a backup of the database won't backup my Mercurial or Git repository. But there will always be a trade off so just figure out which ones you can live with.
精彩评论