Algorithm: dictionary

Targhan · 08:26, 08 August 17

For a "certain project", I need to build some kind of dictionary in order to encode a large amount of text efficiently. For example:
The text "hello low hell" would result in the dictionary "hellow", with every word is a reference to an index of the dictionary:
hello -> 0, 5 (offset, size)
low -> 3, 3
hell -> 0, 4

I have no problem creating my own algorithm, in any language (all this will be generated on PC), but I am pretty sure someone, on CPC or on the net, has already did something like this and I don't want to re-invent the wheel. Does anyone know about an existing implementation? Also, what is the name of such compression ("dictionary" is pretty vague)?

Thanks.

roudoudou · 08:36, 08 August 17

the flower cruncher was LZW, it was slow to crunch/decrunch and less efficient than clever LZ alternatives

text compression is very dependant of the number of differents char to crunch and the language, as you may include short encoding, statistics about char following char, ...

arnoldemu · 08:53, 08 August 17

http://web.stonehill.edu/compsci/lc/textcompression.htm

"Huffman tree" and "Huffman codes"

Targhan · 09:39, 08 August 17

Thanks! However, I believe these algorithms are overkill for what I am doing. I need a fast decrunch time, on individual text files, that would refer only one dictionary for all the texts. Maybe that's doable with Hoffman, tough... Maybe I'm wrong, but for my needs, I believe that working on a word-basis is better than on char-basis, but maybe I'm wrong.

Grim · 10:55, 08 August 17

If you're _not_ aiming for maximum ratio:
- Determine your alphabet size (the number of different chars in your text, ie. letters, numbers, punctuation, etc) however you want (hard-coded or guessed by parsing your text files).
- All your chars must be coded from 0 up to alphabet-size
- Build a frequency-table/histogram of all the words appearing in your text (you will have to define some rules/filters there to decide to ignore words that are too small or if eg. "it's" is a single word, etc)
- Sort that table on the frequency value in descending order
- Delete all words that only appeared once
- Keep only the first (256 - alphabet size) entries in the table

Now you can encode your text as follow:
- Output a first byte with the value of your alphabet size (unless you hard-coded it, in that case, nevermind)
- To output a single char, write out a byte between [0, alphabet size[ (its simply the char-code)
- To output a word, write out a byte between [alphabet size, up to 255] (that's your word-index offsetted by your alphabet size)

To decode:
- Read the first byte to get the alphabet size used (unless yada-yada as above)
- if that byte is less than your alphabet size, it's a char
- Otherwise subtract the alphabet-size to it and you have the index of a word

Dunno about the texts you intend to pack, but from my own "certain project", I've got an average ratio of 60/70% with this (and an RLE code, see below).

Notes:
If your alphabet size is less than 128 and you do not care much about the time it will take the decoder to lookup a word in your table, you can use a flag on bit7 of the first (or last, as you prefer) character of each word and then simply go through all the entries, counting flags until you reach the correct word. Slow, but it will spare you from wasting space with another table of fixed-size pointers to your words.

To get a little more ratio:
Reserving a special RLE code (eg. 255) to indicate to the decoder the next byte gives the number of time the last char must be repeated. You might also want to reserve a code as an EOF marker too.

It will mostly depend on the text to encode, but you can also re-arrange your char-codes so that the ones only appearing in words, but never as literals, are coded last (at the end of the alphabet) so that you can use their codes in the stream for additional word-entries.

Using an entropy-coder to encode literals will improve things much ratio-wise, but will require additional buffers/tables/trees and hit significantly the decoder execution time. Huffman is certainly the most simple and over-documented entropy-coder ever, but it is to entropy-coders what RLE is to compression. Still can be "good enough" I guess.

Quote from: Targhan on 09:39, 08 August 17
I need a fast decrunch time, on individual text files, that would refer only one dictionary for all the texts. Maybe that's doable with Hoffman, tough... Maybe I'm wrong, but for my needs, I believe that working on a word-basis is better than on char-basis, but maybe I'm wrong.

Damn people posting with absolutely vital informations that should have been given from the start just before I reply! Grmlbh!
Forget "Hoffman", he makes better music than fast entropy-decoder last time I checked =)

Targhan · 11:50, 08 August 17

Hey, glad you're still alive Grim

.

Thanks for all this. I had something a bit different in mind, but this gives me some new ideas. As for speed, I only meant "not too slow"

.

Grim · 17:13, 08 August 17

Quote from: Targhan on 11:50, 08 August 17Hey, glad you're still alive Grim .

Yeah, alive, somewhat. What doesn't kill you makes you ... stranger =)

Quote from: Targhan on 11:50, 08 August 17Thanks for all this. I had something a bit different in mind, but this gives me some new ideas. As for speed, I only meant "not too slow" .

For "Not too slow" with better compression ratio, you might try encoding everything bitwise, rather than bytewise. Still do some sort of dictionnary references, but ice the cake with Huffman-codes or even some variable-length codes (Elias, Golomb-Rice or some handcrafted one that won't spill too much bits too fast) to code all your chars (these will have to be sorted by their number of occurrences too, ie. 0 for the most frequent char).

Beyond that, I think things will quickly get more intricate and probably overkill (for ideas, maybe look at the MTF transform that can help keep your variable-length codes small, also the BWT algorithm, real killer for text compression, alas CPU-cycles too).

||C|-|E|| · 18:20, 08 August 17

For our text adventure, we did something similar to Grim´s suggestion (it is PAWS implementation by default). During the adventure only the first 128 ASCII characters are used, so the remaining ones are assigned to the most frequent 128 words in the game, that are stored in a database. This way, you can assign each word to a single character and save quite a lot of space. Very simple, it is not really a proper compression, I would say, but super fast, and it saved us around 25KB

Of course, it also helps a lot to choose your text in a careful way to obtain the maximum benefit from this approach $:-\$

PulkoMandy · 08:01, 09 August 17

What you are looking for is "subsequence packing". I don't know if there is a ready-made Amstrad version, however, but a lot of research papers around DNA sequencing.

Targhan · 11:50, 09 August 17

Thanks ! Damn, this is complex stuff

. Vastly overkill for my needs (and for my brain, I fully admit it).

SRS · 20:00, 09 August 17

You may have a look at shoco or smaz.

http://ed-von-schleck.github.io/shoco/#comparisons-with-other-compressors

Targhan · 23:39, 09 August 17

This is more like it. Thanks!

News:

Algorithm: dictionary