Decompression: Exomizer vs BitBuster vs LZ48

ervin · 23:26, 24 April 18

Hi folks.

I've been thinking about the possibility of using compiled sprite code for displaying very large sprites at a playable framerate.
From my tests, I'm able to get a sprite of 16x16 character blocks moving around the screen very quickly and smoothly.
But of course, compiled sprites take up a lot more memory than sprite data does.

So I thought about using real-time decompression(!).
I compile the compiled sprite as a standalone binary, and then export that to a C array.
That C array (which is of course A LOT smaller than the full compiled sprite code) is then stored somewhere in my main test program.

The uncompressed binary file is 872 bytes (it's quite a simple compiled sprite!).
Exomizer compresses it to 81 bytes, BitBuster makes it 70 bytes, and LZ48 makes it 63 bytes.
(I imagine these ratios would be quite different for larger, more complex uncompressed binaries).

So I then wrote a version of my main program for each of the 3 compression methods.
In all cases, the sprite data is decompressed every frame, to a particular memory location.
Then that memory location is called, effectively running the compiled sprite code.

I was very surprised (and delighted) to discover that all 3 programs still ran at a very playable framerate!
Yes, with realtime decompression every frame!

Of course, the program is only decompressing a small amount of data, so it was very difficult to determine the speed differences between the 3 decompression methods.

So I thought I'd have each program decompress the sprite code 50 times per frame!
The results were very interesting.

To move the sprite from the left of the screen to the right ("scientifically" measured with a stopwatch), with 50 decompresses per frame:
Exomizer took 15 seconds
Bitbuster took 7.5 seconds
LZ48 took 5.5 seconds

A great result for LZ48!
I don't know how the compressed file sizes would compare for larger or more complex sprite code, but I need speed.
And LZ48 seems to be the best option.

[EDIT] It's important to note that the full run of screen redraws in this test takes almost exactly one second.
(The test involves 16 full screen clears and sprite redraws, while the sprite moves across the screen).

So the pure decompression times are 14s (exomizer), 6.5s (bitbuster), and 4.5s (lz48).
That gives a much more accurate impression of the decompression speed differences.

roudoudou · 08:23, 25 April 18

I don't know for bitbuster but when data are too complex, it must be very fast with any compressor as the data are copied with a LDIR
The speed loss is due to block management. Exomizer read this bit per bit so it's very slow. LZ48 is aligned to byte (one byte has data length & key length)
Glad LZ48 fits your needs as that's precisely the purpose it was designed for.

ervin · 11:32, 25 April 18

Quote from: roudoudou on 08:23, 25 April 18
I don't know for bitbuster but when data are too complex, it must be very fast with any compressor as the data are copied with a LDIR
The speed loss is due to block management. Exomizer read this bit per bit so it's very slow. LZ48 is aligned to byte (one byte has data length & key length)
Glad LZ48 fits your needs as that's precisely the purpose it was designed for.

It's perfect for what I am planning.

Thanks for adapting it for Z80.

roudoudou · 12:12, 25 April 18

Quote from: ervin on 11:32, 25 April 18
It's perfect for what I am planning.
Thanks for ~~adapting~~ creating it for Z80.

creating it, it came first on Z80

(both cruncher & decruncher)

ervin · 13:18, 25 April 18

Quote from: roudoudou on 12:12, 25 April 18
creating it, it came first on Z80 (both cruncher & decruncher)

Ah, I stand corrected.

Incidentally, your RASM project looks very interesting!
It may come in very handy for compiling my sprite code into individual binary files.

ervin · 14:00, 25 April 18

This is fantastic!

RASM, followed by LZ48, followed by BIN2C (https://sourceforge.net/projects/bin2c/?source=typ_redirect) has sped up my compiled sprite assembly, compression and conversion immensely!

Putting the following code into a .BAT file (on Windows) means that all those functions can be run together in less than a second!

Code Select


rasm.exe roadSprt.asm -ob roadSprt.bin
lz48.exe -i roadSprt.bin -o roadSprt.lz
bin2c.exe -o roadSprt.c roadSprt.lz

roudoudou · 14:50, 25 April 18

Rasm can generate LZ48 compressed code without external compressor, it will be even faster

this won't change label values inside the LZ section

Code Select


start
lz48
nop
nop
nop
nop
nop
ret
lzclose
end
save"sprite.lz",start,end-start

ervin · 15:22, 25 April 18

Quote from: roudoudou on 14:50, 25 April 18
Rasm can generate LZ48 compressed code without external compressor, it will be even faster
this won't change label values inside the LZ section

Code Select Expand
start lz48 nop nop nop nop nop ret lzclose end save"sprite.lz",start,end-start

That's beautiful!
It works really well.

Thanks so much for your help.

Arnaud · 17:25, 25 April 18

Really interesting !

Quote from: ervin on 23:26, 24 April 18
The uncompressed binary file is 872 bytes (it's quite a simple compiled sprite!).
Exomizer compresses it to 81 bytes, BitBuster makes it 70 bytes, and LZ48 makes it 63 bytes.

And it's always smaller than the original Sprite (256 bytes)

I have some questions :

- How your method is fast comparing to the classic cpct_drawSprite ?

- My understand of the compiled sprite, is the data is mixed with the code. How do you generate your compiled sprites ?

GUNHED · 19:55, 25 April 18

Very interesting. I would like to test LZ48 on CPC. Can somebody please point me to a link where to get a Cruncher (on PC or CPC) and a Decruncher (on CPC of course).

roudoudou · 20:21, 25 April 18

Quote from: GUNHED on 19:55, 25 April 18
Very interesting. I would like to test LZ48 on CPC. Can somebody please point me to a link where to get a Cruncher (on PC or CPC) and a Decruncher (on CPC of course).

Z80 versions in the first post of the official topic http://www.cpcwiki.eu/forum/programming/lz48-cruncherdecruncher/

let me know if you have trouble to use it

ervin · 01:08, 26 April 18

Quote from: Arnaud on 17:25, 25 April 18
Really interesting !
And it's always smaller than the original Sprite (256 bytes)

I have some questions :

- How your method is fast comparing to the classic cpct_drawSprite ?

- My understand of the compiled sprite, is the data is mixed with the code. How do you generate your compiled sprites ?

The sprite I'm drawing is 4096 bytes (when stored as pure bytes).
But it's a very simple sprite with a lot of repetition, so the compiled sprite code is much smaller than that.
Regardless, 4096 bytes are printed by my code.

Also, my compiled sprite has no masking/transparency built in.
However, my plan is to simulate byte-level transparency (not pixel-level), by simply skipping bytes which don't need to be printed. This will of course speed up sprites that aren't perfectly square or rectangular.
Having said that, I *can* simulate pixel-level transparency easily enough, by only applying it to pixels around the edges of the sprite, thereby not causing much slowdown.

Anyway, I created a quick version of my test program to use cpct_drawSprite.
It uses an array of 4096 bytes.

It's very difficult to accurately measure the performance, so I repeated the call to cpct_drawSprite 50 times for each frame.
(The test consists of moving the sprite across the screen, which takes 16 frames).
This test took 18 seconds to complete.
(cpct_drawSpriteMasked is MUCH slower of course).

I then tried the version of my code that uses the compiled sprite code.
Every time the sprite is drawn, the LZ48 compiled sprite code is first decompressed, then the sprite is displayed.
This was also done 50 times per frame, for 16 frames.
This test took 22 seconds to complete, which is quite a lot slower!

But of course, the entire test also contains 16*50=800 calls to the depacker.
That's quite a large overhead, but one that I am happy to live with.

Also, the compiled sprite code is not very efficient at the moment.
However, the fastest sprite code would also be the largest, and that would take the longest to decompress.
I'll need to find a good balance between compile sprite code size, and compressed data size, so that the compressed data doesn't take too long to decompress, but the sprite code still runs quite quickly.
This will be an interesting challenge!

Regarding generation of my compiled sprite code... I don't have a generator yet.
The compiled sprite code is (poorly) hand-written at the moment.
My next major task in this project is to write the code generator.

[EDIT] In the original post of this thread, I mentioned that the test consists of 32 frames. The correct number is 16 frames. I have edited the original post appropriately.

[EDIT #2] I have brought the compiled sprite test down to 19 seconds, with a few simple changes to the compiled sprite code.

ervin · 03:06, 07 May 18

Hi folks.

I've spent the last week+ trying different combinations of sprite sizes, compiled sprite code size etc.

It turns out that decompression (as expected) creates a bigger speed problem, than a large compiled sprite does.
For example, decompressing a large sprite 10 times per frame takes longer than displaying the sprite 10 times per frame.

I've attached an example program that demonstrates this observation.
The program presents 6 options, with different combinations of decrunches vs sprites per frame.

Stopwatch measurements of the 6 options are summarised:
(A) takes 3.3 seconds to move the sprite to the right of the screen.
(B) takes 6.6 seconds
(C) 3.8 seconds
(D) 7.1
(E) 7.9
(F) 11.3

In any case, we are dealing with a very large sprite here. There are limitations to the practicality of this technique.
I've had to create a very simple sprite (containing lots of repetition), that can have all sorts of size optimisations placed into the compiled sprite code.
This results in a smaller compressed sprite routine, and a smaller uncompressed sprite routine. This also makes the compiled sprite code slightly slower, but I can live with it (more looping, CALLs, etc).

Bringing the uncompressed sprite routine down in size made huge differences during the 10x decompression tests - the program got faster, as the uncompressed sprite code got smaller.

Axelay · 15:43, 08 May 18

If you're doing this in an emulator like Winape you could get more accurate timing than with a stopwatch, if you want a more precise time.

If you are using winape, when you hit f8 to bring up the debugger, you should see on the right between the registers and the stack a line labelled 'T' with a big number beside it, and a red x to the right. If you set up break points at the start and end of the routine you want to measure, hit the red x to reset the timer at the first breakpoint, at the second breakpoint the timer will tell you how many microseconds have passed.

There's a help page on the winape debugger here.

ervin · 02:49, 09 May 18

Quote from: Axelay on 15:43, 08 May 18
If you're doing this in an emulator like Winape you could get more accurate timing than with a stopwatch, if you want a more precise time.

If you are using winape, when you hit f8 to bring up the debugger, you should see on the right between the registers and the stack a line labelled 'T' with a big number beside it, and a red x to the right. If you set up break points at the start and end of the routine you want to measure, hit the red x to reset the timer at the first breakpoint, at the second breakpoint the timer will tell you how many microseconds have passed.

There's a help page on the winape debugger here.

That's fantastic!!!
Thanks so much.

I wish I had known this when I was writing RUNCPC!
I should have looked into it.
Ah well.

ervin · 04:56, 10 May 18

Alrighty, I now have a very nice balance between sprite code speed, uncompressed sprite code size, and compressed sprite code size! Very happy with it.

For the sort of game I'm thinking about, the performance is perfectly acceptable.

Also, I have clipping working along the top of the screen.
Clipping against the bottom should be pretty easy, with a bit of self-modifying code.
Left/right will be a bit more challenging - I'll probably allocate some buffer space to the left/right of the screen buffer to handle it.

Sykobee (Briggsy) · 15:24, 10 May 18

This sounds really useful for games with big bosses.

You can either have animated big bosses (obviously with the decompression per animation-frame update (which can be every few frames), or store multiple large boss static images for use during a single level of gameplay (one boss displayed at a time, you only need to decompress when you reach that boss).

Your 4KB original image, I take it that is 80x100 MODE 0 pixels in size (or 100x80, etc)? That's a decent size boss that the CPC never really had due to memory limitations. What does that compile to as ASM, and what did that ASM compress down to?

ervin · 00:53, 11 May 18

My current sprite is 32 x 96 mode 0 pixels in size, so 16 x 96 bytes.
In pure data terms that is 1536 bytes.

However, the compiled sprite code was well over 4KB when it was generated.
I put in various optimisations (skipping blank bytes, ADD HL,BC instead of multiple INC L operations, putting frequently used byte values into registers, etc etc etc) and got it down to 1186 bytes.

Although this also contains a jump table of 192 bytes, for use with vertical clipping. If no clipping is required (for example, a boss that doesn't go out of screen bounds), the compiled sprite code can of course be smaller.

Some figures:

WITHOUT jump table:
uncompressed = 994 bytes
compressed (LZ48) = 270 bytes

WITH jump table:
uncompressed = 1186 bytes
compressed = 463 bytes

The interesting thing is that uncompressed code size is the most important factor in getting the program to run at a good speed, as the decrunch routine has to work a lot harder than the sprite routine!

But that speed loss is compensated for by using compiled sprite code, rather than sprite data.
So in the end we have very large sprites being drawn at a relatively high speed.

ervin · 15:07, 11 May 18

I achieved a small breakthrough tonight, regarding compressed data size.

My compressed data was 463 bytes, and that contained the compiled sprite code *and* the jump table for top border clipping.
If I'm going to have lots of large sprites, 463 bytes is too much.

Until now, I've used the following technique:
The sprite's row1 code might start at address #9000, row2 at #9010, row3 at #9021, row4 at #902f etc.
My jump table stores all of these addresses (for 96 rows), and since all 96 values are different 2-byte numbers, that doesn't compress very well.

Then I had an idea... what if my jump table was simply a list of differences in memory addresses between each row of the sprite code?
I changed the jump table to a "gap table", which became a list of 96 1-byte numbers.

Code Select


gapTable:
defb row2-row1
defb row3-row2
defb row4-row3
...

So the gap table would be a list containing #10,#10,#11,#09 etc.

Then I rewrote the code that determines which row to jump to, in the event of top border clipping.
That worked nicely, but the best thing was the compressed data size... 291 bytes!!!

That was a total shock, and a very nice surprise.
This technique has the potential for excellent compression due to data repetition.

ronaldo · 15:42, 14 May 18

Quote from: ervin on 01:08, 26 April 18
It's very difficult to accurately measure the performance, so I repeated the call to cpct_drawSprite 50 times for each frame.
(The test consists of moving the sprite across the screen, which takes 16 frames).
This test took 18 seconds to complete.
(cpct_drawSpriteMasked is MUCH slower of course).

Just a side note. You don't need to measure performance for CPCtelera routines, as all of them are already measured.

You only need to go to the reference manual. Every function has a section with measures at the end of its documentation. For example, this is the one for cpct_drawSprite:
[attach=1]

With some simple calculations you can know the exact amount of microseconds that the function will take. You can then use WinAPE timer to measure other functions, as has already been suggested.

You also have cpct_bin2c included in CPCtelera. In fact, it works automatically: any binary file found under src/ folder gets automatically converted into C each time it changes (when you launch make).

Moreover, Antonio Villena already made comparisons between many compressors (not including LZ48, I think). You can check them in the documentation of zx7b.

ervin · 23:57, 14 May 18

Thanks ronaldo, that's very useful information.

My only reason for measuring repeated calls to cpctelera's (excellent) sprite routines, was to compare them with real-time decompression and execution of a compiled sprite. Cpctelera's routines were the benchmarks I wanted to strive for.

Sykobee (Briggsy) · 15:46, 17 May 18

Having a 291 byte compiled sprite versus a 1186 byte compiled sprite or a 1536 byte raw data sprite is a really nice achievement.
You can fit 4 to 5 big sprites in the memory of a single one without the compression.
Of course you need to set aside memory area to decompress into.

The 'relative jump' idea is great to save memory, but don't you need to walk the jump table to get the first address now for clipped sprites? The memory savings are more than worth it though, and the time to walk the jump table is likely a lot less than the time to render that amount of sprite.

ervin · 00:16, 18 May 18

Quote from: Sykobee (Briggsy) on 15:46, 17 May 18
Having a 291 byte compiled sprite versus a 1186 byte compiled sprite or a 1536 byte raw data sprite is a really nice achievement.
You can fit 4 to 5 big sprites in the memory of a single one without the compression.
Of course you need to set aside memory area to decompress into.

The 'relative jump' idea is great to save memory, but don't you need to walk the jump table to get the first address now for clipped sprites? The memory savings are more than worth it though, and the time to walk the jump table is likely a lot less than the time to render that amount of sprite.

You're exactly right.

I do indeed loop over the jump table, adding up the amount of bytes to skip, in order to find the memory address to start executing the sprite code from.

Before that though, I check if clipping is required.
If it isn't needed, I jump straight to the beginning of the sprite code.
Otherwise, I do a very small loop that calculates the memory address of the first visible line of the sprite.
And yes, this little overhead is easily offset by not having to draw the clipped parts of the sprite.

For anyone interested, the beginning of the sprite routine looks like this:

Code Select


org #9000

rowGaps:
defb row1-row0
defb row2-row1
defb row3-row2
defb row4-row3
defb row5-row4
defb row6-row5
defb row7-row6
defb row8-row7
defb row9-row8
defb row10-row9
defb row11-row10
defb row12-row11
defb row13-row12
defb row14-row13
defb row15-row14
defb row16-row15
; etc...

; *** CLIP_TOP BEGIN

ld ix,row0 ; default IX to first line of sprite code

ld a,(CLIP_TOP) ; how many lines are being clipped?
cp 0
jr z,clipTop_JP ; if 0, jump to the end of clipTop

ld b,a ; number of rows to skip
ld hl,rowGaps
ld d,0

clipTop_LOOP:
ld e,(hl)
add ix,de
inc l
djnz clipTop_LOOP

clipTop_JP:
ld hl,(BUFFER) ; memory address for top/left of sprite
jp (ix) ; JP to the first visible line of the sprite

; *** CLIP_TOP END

row0: ; first line of sprite code
; etc...

News:

Decompression: Exomizer vs BitBuster vs LZ48