File compression algorithms necessarily contain pointers

Why can Zip compress a single file smaller than multiple files with the same content?


Suppose I have 10,000 XML files. Suppose I want to send it to a friend. Before I send it off, I want to compress it.

Method 1: don't compress them

Results:

Method 2: Unzip each file and send it 10,000 xml files

Command:

Results:

Method 3: Create a single zip file containing 10,000 XML files

Command:

Results:

Method 4: link the files into a single file and compress it

Command:

Results:


Ask:

  • Why do I get so dramatically better results when I compress a single file?
  • I was expecting method 3 to get drastically better results than method 2, but not. Why?
  • Is this behavior specific to? If I tried, would I get different results?

Additional information:


Edit: metadata

One answer suggests that the difference is in the system metadata stored in the zip. I don't think that can be the case. For testing I did the following:

The resulting zip is 1.4MB. This means there is ~ 10MB of unexplained space left.






Reply:


Zip treats the contents of each file separately when compressing it. Each file has its own compressed stream. The compression algorithm (usually DEFLATE) helps identify repeated sections. However, Zip does not provide support for finding redundancy between files.

This is why there is so much additional space when the content is in multiple files: the same compressed stream is inserted into the file multiple times.







ZIP compression is based on repetitive patterns in the data to be compressed, and the longer the file, the better the compression, the more and longer patterns that can be found and used.

Put simply, when you compress a file, the dictionary that maps (short) codes to (longer) patterns is necessarily included in every resulting zip file. If you compress a long file, the dictionary is reused and affects all of its contents.

If your files are even a bit similar (as text always is) then reusing the 'dictionary' becomes very efficient and the result is a much smaller overall zip.







Each file is compressed separately in Zip. The opposite is "solid compression," which means files are compressed together. 7-zip and Rar use solid compression by default. Gzip and Bzip2 cannot compress multiple files, so tar is used first. This has the same effect as solid compression.

Because the XML file has a similar structure and likely similar content when the files are compressed together, the compression will be higher.

For example, if a file contains the string and the compressor has already found that string in another file, it will be replaced with a small pointer to the previous match unless the compressor uses the Fixed Compression option the first time the string occurs in the The file is recorded as a literal which is larger.


In addition to storing the contents of the file, Zip also stores file metadata such as user ID, permissions, creation and modification times, and so on. When you have a file, you have a set of metadata. If you have 10,000 files, you have 10,000 metadata sets.







One option missed by the OP is to compress all files together with compression turned off and then compress the resulting zip file with maximum compression. This roughly emulates the behavior of compressed * nix .tar.Z, .tar.gz, .tar.bz, etc. archives in that the compression exploits redundancies across file boundaries (which the ZIP algorithm cannot do if it running in a single archive). This allows the individual XML files to be extracted later, but maximizes compression. The downside is that the extraction process requires an extra step and temporarily takes up much more space than would be required for a normal ZIP archive.

With the ubiquity of free tools like 7-Zip expanding the tar family on Windows, there's really no need to forego using .tar.gz or .tar.bz, etc. like there is on Linux, OS X, and all BSDs the case is native tools to manipulate them.





The zip compression format saves and compresses each file separately. The repetition between files is not exploited, but only within a file.

By chaining the file, zip can take advantage of the repetitions of all files, resulting in a drastically higher level of compression.

Suppose every XML file has a specific header. This header occurs only once in each file, but is repeated almost identically in many other files. In method 2 and 3 zip could not compress this, but in method 4 it could.






In addition to the metadata Mike Scott mentioned, there is also overhead in the compression algorithm.

If you are compressing a number of small files, you must be very lucky to be able to compress them as they just happen to fill one compression block. If a single monolithic block is compressed, the system can simply continue the data stream to its algorithm, ignoring the "boundaries" (for lack of better words) of the individual files.

It is also known that ASCII has a high compression factor. plus xml often repeats itself very often and makes the metadata a large part of the data that cannot be compressed as easily as the xml content.

If the memory is properly allocated, zip uses a kind of dictionary coding which, because of its repeatability, has a particular effect on ASCII files and especially on XML

Explanation of data compression: http://mattmahoney.net/dc/dce.html


Consider this XML:

An XML has a very repetitive structure. Zip uses these repetitions to create a dictionary whose template occurs more often, and then uses fewer bits when compressing to make more repetitive template and more bits to make less repetitive template save .

If you have these files concatenate , the source file (the source for zip) is large, but contains a lot more repeating patterns, because the distribution of the boring structures of an XML in the large entire file amortized and ZIP offers the possibility to do this template store with fewer bits.

Now if you combine different XML files into a single file, the compression algorithm will find the best one Pattern distribution across all files and not file by file.

Ultimately, the compression algorithm found the best repeated pattern distribution.


In addition to the 7-Zip answer, there is another approach that isn't that good but would be worth a test if for some reason you don't want to use 7-Zip:

Compress the zip file. Usually a zip file is incompressible. However, if it contains many identical files, the compressor can find this redundancy and compress it. Note that I've also seen a small gain in dealing with a large number of files with no redundancy. If you really care about size, it is well worth a try if you have a huge amount of files in your zip file.



We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.