Two nights ago I prototyped algorithm 5 (which is algorithm 2 + algorithm 3, and algorithm 4 was already taken), yesterday I debugged and tested, today I post. The ideas, generally speaking, panned out.
In my last entry, I stated that the version of zlib I was using compressed the calgary corpus to 1,025,183 bytes, with algorithm 2 compressing it to 1,024,802 bytes. Algorithm 5 manages to do significantly better at 998,766 bytes, a savings of 26,417 bytes over zlib.
There are some inputs for which zlib does better, sometimes significantly, but on larger files, my algorithm tends to win. Here is some samples of the particular calgary corpus files, their original size, the size of the compressed representation that my algorithm, zlib, and bzip2 produce (I usually run my algorithm on the smallest stuff first, just so I can see relevant output faster)
name size alg5 zlib bzip2
----------------------------------------
obj1 21504 -> 11881 10317 10787
progc 39611 -> 14506 13343 12544
progp 49379 -> 11654 11228 10710
paper1 53161 -> 19501 18558 16558
progl 71646 -> 16809 16255 15579
paper2 82199 -> 30290 29760 25041
trans 93695 -> 19227 19045 17899
geo 102400 -> 77655 68433 56921
bib 111261 -> 34500 35228 27467
obj2 246814 -> 87453 81505 76441
news 377109 -> 136305 144800 118600
pic 513216 -> 58509 56465 49759
book2 610856 -> 190542 206664 157443
book1 768771 -> 289904 313582 232598
For larger natural language texts, my algorithm does pretty well, the overall ~26k improvement coming primarily from the 23.5k improvement on book1 alone, with book2's improvements overcompensating for the mild (except for geo) losses in the other examples.
I had started trying out some other inputs, like the Canterbury corpus, some artifical inputs, and some "large" inputs, all made available with the Canterbury corpus. It only got a little ways in when I stopped it to muck about with some other tunings (which didn't work out), but here are some sample results from the other corpuses, corpi?
name size alg5 zlib bzip2
--------------------------------------------------------
artificl\a.txt 1 -> 3 9 37
cantrbry\grammar.lsp 3721 -> 1349 1222 1283
cantrbry\xargs.1 4227 -> 1927 1736 1762
cantrbry\fields.c 11150 -> 3356 3122 3039
cantrbry\cp.html 24603 -> 8832 7961 7624
cantrbry\sum 38240 -> 14891 12990 12909
artificl\aaa.txt 100000 -> 5 121 47
artificl\alphabet.txt 100000 -> 33 290 131
artificl\random.txt 100000 -> 91321 75735 75684
cantrbry\asyoulik.txt 125179 -> 48731 48897 39569
cantrbry\alice29.txt 152089 -> 52814 54404 43202
cantrbry\lcet10.txt 426754 -> 130877 144904 107706
(updated)
cantrbry\plrabn12.txt 481861 -> 181116 195261 145577
cantrbry\ptt5 513216 -> 58509 56465 49759
cantrbry\kennedy.xls 1029744 -> 228718 203992 130280
large\world192.txt 2473400 -> 570187 724970 489583
large\bible.txt 4047392 -> 997609 1191885 845623
large\E.coli 4638690 -> 1451107 1341990 1251004
Again, for larger natural language texts, my algorithm does pretty well against zlib, and even manages to do pretty well on the most compressable of inputs (aaa.txt and alphabet.txt), due to some implementation quirks.
I'm curious to see how well it does on the bible (included in the "large" inputs), but it probably won't get to that for a little while longer.
Updated: I added a few more results, the most interesting being a portion of the 1992 edition of the CIA world Factbook (large\world192.txt) where my algorithm is able to improve upon zlib by almost 150,000 bytes, and a portion of the bible also beating zlib by just under 195,000 bytes. It still generally loses to bzip2, sometimes by a large margin, but that isn't terribly surprising.
I still have some ideas, which is good, as so far at least a few of my ideas seem to offer improvements.
Hopefully I'll eventually be able to win on average bits per character... bpci = 8*compressedsizei/inputsizei avg_bpc = SUM(bpci/n) |