How do compression utilities add files sequentially to a compressed archive?_问答_开发者

How do compression utilities add files sequentially to a compressed archive?

开发者 https://www.devze.com 2023-02-28 23:18 出处：网络

For example, when you tar -zcvfa directory, you can see a list of files 开发者_如何学运维being added sequentially to the final gzip file.

For example, when you tar -zcvf a directory, you can see a list of files 开发者_如何学运维being added sequentially to the final gzip file.

But how does that happen?

Any compression algorithm at the very basic level uses the redundancy in data to represent it in a better way and hence save space.

But when file n is being added, there is already a way chosen to represent the first n - 1 files which might not be the optimal one because until file n came across we never knew what the best way was.

Am I missing something? If not, does this mean that all these compression algorithms choose some sub-optimal representation of data?

In gzip, the redundancy is restricted to a specific window size (by default 32k if I remember right). That means that after you process uncompressed data past that window, you can start writing compressed output.

You could call that "suboptimal", but the benefits provided, such as the ability to stream, and possibly error recovery (if there are synchronisation marks between windows; not sure how gzip works here), are worth it.

The short answer is that it doesn't -- gzip works incrementally, so the first part of a file generally is not compressed quite as much as later parts of the file.

The good point of this is that the compressed data itself contains what's necessary to build a "dictionary" to decompress the data, so you never have to explicitly transmit the dictionary with the data.

There are methods of compression (e.g., two-pass Huffmany compression) where you scan through the data to find an ideal "dictionary" for that particular data, and then use it compress the data. When you do this, however, you generally have to transmit the dictionary along with the data to be able to decompress it on the receiving end.

That can be a reasonable tradeoff -- if you have a reasonably high level of certainty that you'll be compressing enough data with the same dictionary, you might gain more from the improved compression than you lose by transmitting the dictionary. There is one problem though: the "character" of the data in a file often changes within the same file, so the dictionary that works best in one part of the file may not be very good at all for a different part of the file. This is particularly relevant for compressing a tar file that contains a number of constituent files, each of which may (and probably will) have differing redundancy.

The incremental/dynamic compression that gzip uses deals with that fairly well, because the dictionary it uses is automatically/constantly "adjusting" itself based on a window of the most recently-seen data. The primary disadvantage is that there's a bit of a "lag" built in, so right where the "character" of the data changes, the compression will temporarily drop until the dictionary has had a chance to "adjust" to the change.

A two-pass algorithm can improve compression for data that remains similar throughout the entire stream you're compressing. An incremental algorithm tends to do a better job of adjusting to more variable data.

When you say tar -zcvf X, that is equivalent to saying:

tar -cvf X | gzip

So all gzip sees is bunch of bytes that it compresses, tar and gzip don't have a conversation about how tar should order the files for gzip to optimially compress the entire stream. And gzip doesn't know the tar data format so it cannot rearrange things for better compression.