Why Some Files Refuse to Compress (and What You Can Do About It)
Ever zip a folder and find the result is the same size—or even bigger? This article explains, in plain language, how compression works and why certain files won’t shrink. You’ll learn practical strategies to get better results and avoid common pitfalls when archiving data.
Compression in Plain English
Compression works by spotting patterns and describing them more efficiently. Imagine a long string like AAAAAABBBBBB: instead of storing every letter, a compressor stores “6 As, 6 Bs.” In real tools, this idea appears as dictionary methods (like LZ77/LZMA) that reference earlier bytes instead of repeating them, and entropy coding (like Huffman or arithmetic coding) that uses fewer bits for more common symbols. When data has structure—repeated words in a document, recurring pixel runs in simple graphics, recurring numbers in logs—compressors can exploit it. When data looks random, there’s nothing to exploit. Deflate (used in ZIP) combines dictionary matching with Huffman coding; 7z’s LZMA/LZMA2 expands the dictionary and modeling for better ratios at the cost of CPU and memory; Zstandard (zstd) balances high speed with strong compression using modern modeling.
Why Some Files Don’t Shrink (and Sometimes Grow)
Many modern formats are already compressed. JPEG, PNG, MP3, AAC, H.264/H.265 video, WebP, and many PDFs bundle their own internal compression. Wrapping a JPEG in a ZIP often won’t reduce its size; the best a compressor can do is store it as-is. In fact, trying hard to recompress already compressed data adds overhead—headers, block markers, and failed attempts to find patterns—making the archive slightly larger. Another culprit is encrypted or randomized data. Encryption intentionally removes patterns, making the file statistically indistinguishable from noise. Compressed backups, some databases, and archives-within-archives also appear as noise to a compressor. Even text can resist compression if it’s short or highly diverse (many unique symbols), because overhead dominates. The rule of thumb: the more structure and repetition your data has, the better it compresses; the more it looks like TV static, the less it will shrink.
Picking Settings and Formats That Match the Job
One size doesn’t fit all. Your choice of algorithm and level should match your files and constraints. For mixed documents and source code, 7z with LZMA/LZMA2 or ZIP with Deflate at a higher level can yield strong ratios, but expect slower processing. For very large datasets where time matters, Zstandard at a medium level often beats Deflate on both speed and size. For collections that share many similar files (for example, log archives or source trees), solid compression (supported by 7z and others) can dramatically improve ratios by letting the compressor find cross-file patterns; the tradeoff is slower random access to individual files. For media-heavy archives with many JPEGs, MP4s, or PNGs, use “store” (no compression) for those items to avoid wasting CPU and potentially growing the archive. When archiving for long-term compatibility, ZIP remains the safest bet due to universal support, but consider 7z or zstd-based formats when you control both ends and care about performance or ratio.
When Compression Disappoints: Practical Remedies
Start by separating files into compressible and non-compressible groups. Text, CSV, JSON, XML, logs, and executable code usually compress well; images, audio, and video usually don’t. Apply higher compression only where it pays off. Avoid nesting archives—zipping a folder of zips wastes time and seldom shrinks further. If you routinely archive images or videos to save space, consider pre-processing instead: convert BMP to PNG (lossless), or re-encode images and videos with modern codecs (lossy, with quality tradeoffs) before archiving; then store them without further compression. For massive collections of similar small files, use a solid archive or tar them first, then compress the tarball to help the compressor see cross-file patterns. If you hit errors like “unexpected end of file” or CRC mismatches, test the archive integrity with your tool, try extracting with a different engine, and ensure you’re not transferring archives through systems that alter bytes (for example, line-ending conversions). For speed or memory-limited environments, choose a faster algorithm and moderate level, and avoid using high dictionary sizes on low-RAM machines.