Compression of text (and other)

teopl · 19:53, 01 July 19

Hi, I am using two ZX7 decompression for my sprites, levels and text for my (first) cpc 464 game.

Few days ago I was thinking that since I use only 48 characters in fonts (digits, uppercase letters and few special chars) - I can encode that in 6 bits instead of 8 - before the compression!

After some time result was that in the end - I didn't save anything and I even occupied had 5-10% more space...

Now my question is:

- how do you handle text compression - maybe to create some custom word dictionary based on most frequent word usage?- how does the zx7 compression works in general - what is it good/bad for?
- I know that there are some topics on this but, in a conclusion - what are my options for compression (besides cpctelera's zx7)?
- some links to latest/usable compressor/decompressor code would be great - I read here on the forum but all is spread across the posts and githubs...

so.. compression anyone?

Docent · 02:46, 02 July 19

Quote from: teopl on 19:53, 01 July 19
- how do you handle text compression - maybe to create some custom word dictionary based on most frequent word usage?

Its the simplest to implement and pretty effective compression method - replace the longest and the most frequent words with a single byte, being a index into a table of pointers to replaced words.

Quote-how does the zx7 compression works in general - what is it good/bad for?

As far as I know, zx7 uses a type of lz77 compression algorithm with fixed bitsize offsets. Basically it tries to find repeating sequences of data and replace them with an offset to their previous appearance in the data stream.

teopl · 22:29, 02 July 19

@Docent Yeah, I guess I will do that, thanks. I was just hoping that there is something out there already... But I guess this is the best solution.

So I will count bytes saved like this:

BYTES_SAVED = WORD_FREQUENCY * (WORD_LENGTH-1)

then sort it and use words with largest saving for all of 255-48=~200 free codes (since I will still use my 48 letters for other words).

GUNHED · 02:25, 03 July 19

In this case use Exomizier (thread in this forum). But it's a bit slow.

Use Madrams Turbo Cruncher for speed and efficiency.

Docent · 23:31, 03 July 19

Quote from: teopl on 22:29, 02 July 19
@Docent Yeah, I guess I will do that, thanks. I was just hoping that there is something out there already... But I guess this is the best solution.

So I will count bytes saved like this:

BYTES_SAVED = WORD_FREQUENCY * (WORD_LENGTH-1)

then sort it and use words with largest saving for all of 255-48=~200 free codes (since I will still use my 48 letters for other words).

Exactly, this approach should give you the best compression with this method. Additionally, it has an advantage over compressing data with a separate packer - it allows for realtime decompression/display of any packed sentence inside data without the need to decompress the whole data.
If word compression will not be good enough, you can try with digraphs/trigraphs. The approach is the same but you do not use words but two or three letter groups. The frequency table you can generate by yourself from your data (it may allow for better compression ratio) or you can use tables that are already available for different languages on internet. With trigraphs you can theoretically reduce original text by 66%

SRS · 20:10, 04 July 19

Quote from: teopl on 22:29, 02 July 19
@Docent Yeah, I guess I will do that, thanks. I was just hoping that there is something out there already... But I guess this is the best solution.

So I will count bytes saved like this:

BYTES_SAVED = WORD_FREQUENCY * (WORD_LENGTH-1)

then sort it and use words with largest saving for all of 255-48=~200 free codes (since I will still use my 48 letters for other words).

Hope you share your solution/code here ?

introspec · 19:01, 05 July 19

Quote from: teopl on 19:53, 01 July 19Few days ago I was thinking that since I use only 48 characters in fonts (digits, uppercase letters and few special chars) - I can encode that in 6 bits instead of 8 - before the compression! After some time result was that in the end - I didn't save anything and I even occupied had 5-10% more space...

First I'd like to mention why this did not work. Most Z80-friendly compressors compress using a byte as the basic unit of data. So, when you have 8-bit characters, the compressor can exploit repeating words (or phrases). At the same time, when you convert text into 6-bit stream of data, the compressor would still try to exploit similarities on the level of bytes. Thus, because of varying bit shifts produced when packing 6-bit symbols into bytes, even the same words end up generating different byte sequences, so compressor has a harder job finding repetitions.

Quote from: teopl on 19:53, 01 July 19How do you handle text compression - maybe to create some custom word dictionary based on most frequent word usage?- how does the zx7 compression works in general - what is it good/bad for? I know that there are some topics on this but, in a conclusion - what are my options for compression (besides cpctelera's zx7)?

Generally speaking, most compressors on Z80 would work in a similar way to ZX7. The answer to your question depends on your primary use scenario.

It could be that you need to compress/decompress a few thousand word article at a time. If this is the case, you'd do best if you use one of the stronger compressors. You'd get the best ratio with Shrinkler, but it is extremely slow to decompress and it also requires a substantial memory buffer for decompression. A more practical option would be to use Exomizer - it only needs 156 bytes buffer for decompression and would decompress data at the rate of about 250 cycles per byte.
If you need faster decompression, a further notch down the compression ratio would be ApLib or Hrust1. They would happily decompress you data at about 100-120 cycles per byte and will still have a better compression ratio compared to ZX7.

It could also be that you need to compress a lot of smaller texts, maybe a few hundred bytes per piece, at least 50 or more pieces. In this case Exomizer will become a lot less efficient because it would be penalized for storing a compression tree. This is where ApLib or Hrust1 can overtake Exomizer simply because they do not use trees at all.

In addition, your thinking about having a dictionary is correct in that if you could identify a bunch of the most common words, you can give it to the compressor so that it can re-use these words during the compression. However, the compressors that are based purely on such dictionaries tend to compress less well that the generic compressors similar to Exomizer or ApLib or Hrust 1. What you can do to combine the benefits of both worlds is to use a two-step process: 1) generate a small dictionary specifically based on all your files to be compressed and then 2) compress your files using this dictionary. You can generate such dictionaries using the tool that comes with Zstd. Unfortunately, the actual compression using dictionaries is not available with every data compressor. I am not aware of any compressor that would allow compression with dictionaries for Exomizer, ApLib or Hrust1. Slightly worse compressing packers that can work with external dictionaries are MegaLZ and LZSA2.

teopl · 22:20, 11 July 19

@Docent: Thanks for the digraphs/trigraphs idea, that 66% sounds very good

good to know that something makes sense before you implement it.
(I mean, I could have guessed that Z80 works on bytes and that I will just randomize them by encoding to 6 bits. Instead I was just implementing the wrong idea :| )

@SRS: I will share the code when I do it but since I am in a rush for the (hopefully) upcoming #CPCRetroDEV I left the text thing for later and I am into creating levels and story now.
Anyway, complete source will be shared after the competition (game is called "LUDIC - break the loop") but I will try to share text encoding/decoding as soon as I complete and test it.

@introspec: Thanks for very detailed post, I will check those links for sure in detail when I have time. I heard about Shrinkler but never tried it. For me speed is not important in this case
since I will decompress texts on level load, but if decompressor code is very large I may use Exomizer after the dict compression part.

Docent · 01:06, 12 July 19

Quote from: teopl on 22:20, 11 July 19
@Docent: Thanks for the digraphs/trigraphs idea, that 66% sounds very good good to know that something makes sense before you implement it.
(I mean, I could have guessed that Z80 works on bytes and that I will just randomize them by encoding to 6 bits. Instead I was just implementing the wrong idea :| )

Well, the compressor you used operates on bytes but z80 also operates on bits, so you can still encode each character in 6 bits and store as a bit stream and achieve 25% gain. Then you can use bit shifting to decode bitstream into chars.

teopl · 10:45, 12 July 19

I meant to say "Z80 compressor works on bytes", sorry. Anyway, bit shifting is exactly what I did before I realized it's worse then before

I mean it crossed my mind that this can be worse due to randomization but I guess I hoped that ZX7 is smarter.

@SRS: Here is script I started for generating C array of 200 most frequent words in a text file. It's adjusted for my needs: all capital English letters and single quote, loosing all other characters.

Code Select

#!/bin/bash

if (( $# != 1 )); then
    echo "*** usage: ./frequent_words.sh TEXT_FILE"
    exit 1
fi

TEXT_INPUT=$1
WORD_USAGE="word_usage.txt"
FREQUENT_WORDS_C="frequent_words.c"

rm -rf foo >/dev/null 2>&1
cat ${TEXT_INPUT} |perl -p -e "s/[^A-Z \n\r']//g" | tr ' ' '\n' | sort | uniq -c |sort >foo

rm -rf ${WORD_USAGE} >/dev/null 2>&1
while IFS1='' read -r LINE || [[ -n "$line" ]]; do

    WORD=$(echo ${LINE} |cut -d' ' -f2)    

    REPEATED=$(echo ${LINE} |cut -d' ' -f1)

    LENGTH_MINUS_ONE=${#WORD}
    LENGTH_MINUS_ONE=$(bc <<< "${LENGTH_MINUS_ONE} - 1")

    SAVED_BYTES=$(bc <<< "${REPEATED} * ${LENGTH_MINUS_ONE}")

    printf %05s "${SAVED_BYTES}" >>${WORD_USAGE}
    printf " \"${WORD}\" ${REPEATED} * ${LENGTH_MINUS_ONE}\n" >>${WORD_USAGE}

done < foo
rm -rf foo >/dev/null 2>&1

cat ${WORD_USAGE} |sort -r |head -n200 >foo

printf "char * const frequent_words[] = {\n $(cat foo |perl -p -e 's/.*(\"[^\"]*\").*/    \1,/g' |perl -p0e 's/,\n$/\n\}/smg')" >${FREQUENT_WORDS_C}
rm -rf foo >/dev/null 2>&1

printf "*** TXT: Generated '${FREQUENT_WORDS_C}'\n"

Docent · 00:07, 13 July 19

Quote from: teopl on 10:45, 12 July 19
I meant to say "Z80 compressor works on bytes", sorry. Anyway, bit shifting is exactly what I did before I realized it's worse then before
I mean it crossed my mind that this can be worse due to randomization but I guess I hoped that ZX7 is smarter.

If you just pack chars into 6 bits and store them as bitstream in memory without applying zx7 compressor then you should get that 25% compression

Different question though if its enough...

Docent · 00:20, 13 July 19

Hi introspec,
Is that you who did the z80 compresors comparison here? http://hypr.ru/blog/dev/740.html
If so, congratulations on beating my lz4 z80 implementation

It looks that there is stili some space for improvement to look into...
btw: great article

SRS · 20:02, 13 July 19

QuoteI meant to say "Z80 compressor works on bytes", sorry. Anyway, bit shifting is exactly what I did before I realized it's worse then before
I mean it crossed my mind that this can be worse due to randomization but I guess I hoped that ZX7 is smarter.

@SRS: Here is script I started for generating C array of 200 most frequent words in a text file. It's adjusted for my needs: all capital English letters and single quote, loosing all other characters.

Nice, I'll give it a try !

introspec · 22:19, 13 July 19

Quote from: Docent on 00:20, 13 July 19[size=78%]Is that you who did the z80 compresors comparison here? [/size][/size][size=78%]http://hypr.ru/blog/dev/740.html[/size]

If so, congratulations on beating my lz4 z80 implementation It looks that there is stili some space for improvement to look into...
btw: great article

Yes, it was me. Thanks for your kind words and nice to meet you - I used several cool ideas from your decompressor too. In fact, I've now got even faster decompressor; I hope to publish this work reasonably soon.

News:

Compression of text (and other)