File compression algorithms necessarily contain pointers
Why can Zip compress a single file smaller than multiple files with the same content?
Suppose I have 10,000 XML files. Suppose I want to send it to a friend. Before I send it off, I want to compress it.
Method 1: don't compress them
Method 2: Unzip each file and send it 10,000 xml files
Method 3: Create a single zip file containing 10,000 XML files
Method 4: link the files into a single file and compress it
- Why do I get so dramatically better results when I compress a single file?
- I was expecting method 3 to get drastically better results than method 2, but not. Why?
- Is this behavior specific to? If I tried, would I get different results?
One answer suggests that the difference is in the system metadata stored in the zip. I don't think that can be the case. For testing I did the following:
The resulting zip is 1.4MB. This means there is ~ 10MB of unexplained space left.
Zip treats the contents of each file separately when compressing it. Each file has its own compressed stream. The compression algorithm (usually DEFLATE) helps identify repeated sections. However, Zip does not provide support for finding redundancy between files.
This is why there is so much additional space when the content is in multiple files: the same compressed stream is inserted into the file multiple times.
ZIP compression is based on repetitive patterns in the data to be compressed, and the longer the file, the better the compression, the more and longer patterns that can be found and used.
Put simply, when you compress a file, the dictionary that maps (short) codes to (longer) patterns is necessarily included in every resulting zip file. If you compress a long file, the dictionary is reused and affects all of its contents.
If your files are even a bit similar (as text always is) then reusing the 'dictionary' becomes very efficient and the result is a much smaller overall zip.
Each file is compressed separately in Zip. The opposite is "solid compression," which means files are compressed together. 7-zip and Rar use solid compression by default. Gzip and Bzip2 cannot compress multiple files, so tar is used first. This has the same effect as solid compression.
Because the XML file has a similar structure and likely similar content when the files are compressed together, the compression will be higher.
For example, if a file contains the string and the compressor has already found that string in another file, it will be replaced with a small pointer to the previous match unless the compressor uses the Fixed Compression option the first time the string occurs in the The file is recorded as a literal which is larger.
In addition to storing the contents of the file, Zip also stores file metadata such as user ID, permissions, creation and modification times, and so on. When you have a file, you have a set of metadata. If you have 10,000 files, you have 10,000 metadata sets.
One option missed by the OP is to compress all files together with compression turned off and then compress the resulting zip file with maximum compression. This roughly emulates the behavior of compressed * nix .tar.Z, .tar.gz, .tar.bz, etc. archives in that the compression exploits redundancies across file boundaries (which the ZIP algorithm cannot do if it running in a single archive). This allows the individual XML files to be extracted later, but maximizes compression. The downside is that the extraction process requires an extra step and temporarily takes up much more space than would be required for a normal ZIP archive.
With the ubiquity of free tools like 7-Zip expanding the tar family on Windows, there's really no need to forego using .tar.gz or .tar.bz, etc. like there is on Linux, OS X, and all BSDs the case is native tools to manipulate them.
The zip compression format saves and compresses each file separately. The repetition between files is not exploited, but only within a file.
By chaining the file, zip can take advantage of the repetitions of all files, resulting in a drastically higher level of compression.
Suppose every XML file has a specific header. This header occurs only once in each file, but is repeated almost identically in many other files. In method 2 and 3 zip could not compress this, but in method 4 it could.
In addition to the metadata Mike Scott mentioned, there is also overhead in the compression algorithm.
If you are compressing a number of small files, you must be very lucky to be able to compress them as they just happen to fill one compression block. If a single monolithic block is compressed, the system can simply continue the data stream to its algorithm, ignoring the "boundaries" (for lack of better words) of the individual files.
It is also known that ASCII has a high compression factor. plus xml often repeats itself very often and makes the metadata a large part of the data that cannot be compressed as easily as the xml content.
If the memory is properly allocated, zip uses a kind of dictionary coding which, because of its repeatability, has a particular effect on ASCII files and especially on XML
Explanation of data compression: http://mattmahoney.net/dc/dce.html
Consider this XML:
An XML has a very repetitive structure. Zip uses these repetitions to create a dictionary whose template occurs more often, and then uses fewer bits when compressing to make more repetitive template and more bits to make less repetitive template save .
If you have these files concatenate , the source file (the source for zip) is large, but contains a lot more repeating patterns, because the distribution of the boring structures of an XML in the large entire file amortized and ZIP offers the possibility to do this template store with fewer bits.
Now if you combine different XML files into a single file, the compression algorithm will find the best one Pattern distribution across all files and not file by file.
Ultimately, the compression algorithm found the best repeated pattern distribution.
In addition to the 7-Zip answer, there is another approach that isn't that good but would be worth a test if for some reason you don't want to use 7-Zip:
Compress the zip file. Usually a zip file is incompressible. However, if it contains many identical files, the compressor can find this redundancy and compress it. Note that I've also seen a small gain in dealing with a large number of files with no redundancy. If you really care about size, it is well worth a try if you have a huge amount of files in your zip file.
- What is the most valuable banknote to date
- What are the biggest myths about Walmart
- What is the Ordo Salutis in Christianity
- How is guanciale different from regular bacon
- Is loving or being loved a strength
- How do I block someone on IMO
- Which direction does a magnet point
- What is a liquidity bonus
- Why should Modi form the ruling party?
- What can I do after I graduate from BALLB
- Why is MN so liberal
- Do Tibetan Americans identify as Chinese Americans
- Why is string theory a mess
- Why are natural disasters common in Ireland
- What is the frame rate
- In which language was APL originally written
- How do I get out of loneliness
- Animal cells contain DNA
- Will the density float or decrease
- What do the British mean by neoliberalism?
- Who is the Emir of Dubai
- Where is Minsk 1
- Will Jesse ever return to The Flash?
- What are the political determinants of the vote