STB.H COMPRESSOR I've never bothered documenting the lame compressor that's included in stb.h, but it turns out it compresses better than NTFS's file compression on the couple tests I've run, while using similar (fairly simple) technology, LZ77 without an entropy coder. COMPRESSION RESULTS calgary files calgary.tar enwik8 enwik9 gzip 1,022,810 36,445,248 322,591,995 lzop 1,162,904 41,217,688 366,349,786 stb.h:4MB 1,570,429 1,514,354 49,834,290 437,129,232 stb.h:256K 1,583,171 1,577,732 55,123,277 469,520,316 NTFS 1,912,832[1] 1,884,160 66,134,016 636,870,656 uncompressed 3,141,622 3,152,896 100,000,000 1,000,000,000 gzip, lzop numbers are as reported in Matt Mahoney's "Data Compression Explained" and "Large Text Compression Benchmark". NTFS are the results I found using NTFS compresison on my machine (running in Win7 64-bit). Matt Mahoney's "Data Compression Explained" reports item [1] as the nearly identical "1,916,928" and [2] as "636 MB". gzip is an LZ77-based compressor with an entropy coder. lzop is an LZ77 compressor optimized for speed, and I don't remember if it has an entropy coder (it's open source, but the technology has not been documented, and the source isn't very readable) or if it does any bit-packing. (It seems like it must, to get so close to gzip.) stb.h:256K is the compressor in stb.h (described in more detail below) using a 256KB sliding window on decompression, and 1MB of storage for the hash table while compressing. stb.h:4MB uses a 4MB sliding window at decompression and 16MB of storage for the hash table while compressing. The size of the sliding window affects both the size of the compressed data and the decompression memory requirements; the size of the hash table affects the size of the compressed data but does not affect decompression memory (though it may affect compression speed by causing more cache misses -- or it may not, I've never tested). The hash table sizes used were chosen basically arbitrarily. The code for the stb.h library is in the public domain and is available on the web, and the particular compressor/decompressor used to get the above results using the library is listed at the end of this file. What about a couple binary files? lzo-2.02.tar.gz 7za.exe (cmdline 7zip v4.65) stb.h:4MB 596,536 372,827 stb.h:256K 596,547 374,090 NTFS 599,387 397,312 uncompressed 599,387 536,064 STB COMPRESSOR PERFORMANCE Theoretically, the stb.h compressor can run fast, although no profiling-based optimization was performed so it could probably be faster. The compressor does fully greedy matching (not lazy matching), so there may be better compression obtainable by trading off some speed. It's an open question whether the difference between NTFS and stb.h is due to the file format (LZSS vs byte-packed LZ77) or just due to different compressor performance and/or memory usage. STB COMPRESSION FORMAT gzip is optimized for compression size traded-off with performance. lzop is primarily optimized for performance. The stb.h compressor was optimized largely for source code size/simplicity. It is an LZ77-esque compressor which outputs all tokens on byte boundaries. (It is theoretically possible to run an entropy coder on the output bytes, although you lose a lot of context by treating them all as one dictionary. For example, the ~49MB enwik8 compressed by stb.h can be further compressed to 40MB by winrar's default mode, whereas the original file compresses to 30MB.) The compressed stream is a sequence of the following tokens: TYPE BYTE SEQUENCE [dict] 00000100 yyyyyyyy yyyyyyyy yyyyyyyy xxxxxxxx xxxxxxxx [END] 00000101 11111010 [dict] 00000110 yyyyyyyy yyyyyyyy yyyyyyyy xxxxxxxx [literals] 00000111 zzzzzzzz zzzzzzzz [literals] 00001zzz zzzzzzzz [dict] 00010yyy yyyyyyyy yyyyyyyy xxxxxxxx xxxxxxxx [dict] 00011yyy yyyyyyyy yyyyyyyy xxxxxxxx [literals] 001zzzzz [dict] 01yyyyyy yyyyyyyy xxxxxxxx [dict] 1xxxxxxx yyyyyyyy xxxxxxxx: match length - 1 yyyyyyyy: backwards distance - 1 zzzzzzzz: num literals - 1, followed by the literal bytes The token values were designed ad hoc, without any analysis or number crunching or tuning against any particular files. No measurements have been made of the performance of the token types, or whether certain of the token types are even ever used. As can be seen, the token encoding resembles an extremely skewed huffman tree, with the remaining bits always fully consumed so no bit-buffering is needed. The data is easy to decode -- while individual fields cross byte boundaries, they always end on byte boundaries; no byte shares data from more than one field, so each field can be determined by decoding 1-3 bytes and then masking out the top bits. Runs of up to 32 literals can be coded with 1 byte of overhead, up to 2048 literals with 2 bytes of overhead, and up to 65536 literals with 3 bytes of overhead. Thus both short bursts of unique data and large uncompressable blobs are encoded efficiently. Matches are flexible. Nearby short matches can be coded in two bytes, while matches of length 64KB at distances of up to 16MB can be coded in 6 bytes. Moderate size and distance matches can be coded using intermediate byte counts. The complete stream format is slightly more thoroughly documented in stb.h, including the specification of the header and an adler32 checksum. STB COMPRESSOR DESCRIPTION The above description thoroughly characterizes the interesting part of the stb compressor -- the byte-aligned- but flexibly-sized-token file format. For completeness, I provide a description of the speed-oriented compressor. The compressor walks a pointer through the data to be decoded. It attempts to find a match for the data starting at the current location by hashing. Specifically, it computes four hash values for the current location; a hash value for the first 3 characters, for the first 5 characters, for the first 9 characters, and for the first 13 characters. Conceptually, it then looks in four hash tables for the most recent location in the sliding window that corresponds to each hash value. If the most recent match matched 13 characters, all four entries will point to the same spot in the window (because that spot matched both 13 characters, 9 characters, 5 characters, and 3 characters). But if longer matches are rarer, then typically the entries indexed by longer hash values will be further back in the window. Thus in the best case there will be 4 different matching locations that can be tested for the best match. (Note that the number of hash entries is always 4 and old entries are always overwritten on collision, thus if the hash tables are too small, a long window won't make much difference since items very far back in the window won't appear in the hash table anyway.) The hash value itself is extremely trivial and designed to be fast. Each hash is computed by circular shifting the previous hash (with fewer characters) and adding in the new data (addition is used rather than xor because carries propogate and give better mixing, and because of the possibility of LEA being used on x86). For a given hash value, there is only one matching entry in the hash table. If there is a hash collision, then the "true" match is lost. Matches which are short are more likely to disappear on collisions, because long matches will hash to multiple places and are unlikely to collide on all of them. To simplify coding and reduce register pressure in the inner loop, instead of four conceptual hash tables, one physical hash table is used to store all four entries. Thus entries from each conceptual hash table can collide with and hide other entries in the hash table. It is unclear what the practical cost of using one hash table instead of four is (they can collide against each other, but they also have more room in the table, so it's hopefully a wash). STB.H COMPRESSION PROGRAM // the following program is included to allow the possibility // of reproducing results. the actual compression algorithm // is found in stb.h #define STB_DEFINE #include "stb.h" /* http://nothings.org/stb.h -- above results used version 2.23 */ int main(int argc, char **argv) { if (argc >= 2) { int i; stb_compress_window(1 << 18); // bytes stb_compress_hashsize(1 << 20); // bytes switch (argv[1][0]) { case 's': stb_compress_window(1 << 22); stb_compress_hashsize(1 << 24); /* fallthrough */ case 'c': for (i=2; i < argc; ++i) { int len; char *f = stb_file(argv[i], &len); if (!f) stb_fatal("Couldn't open %s\n", argv[i]); stb_compress_tofile(stb_sprintf("%s.stbcmp", argv[i]), f, len); free(f); } return 0; case 'd': for (i=2; i < argc; ++i) { int len; char *f = stb_decompress_fromfile(argv[i], &len); if (!f) stb_fatal("Couldn't open %s\n", argv[i]); stb_filewrite(stb_sprintf("%s.decomp", argv[i]), f, len); free(f); } return 0; } } fprintf(stderr, "Usage:\n" " scompress c {files to compress}\n" " scompress d {files to decompress}\n"); return 1; }