Here are some example routines for drawing packed mode 0 sprites.
I wanted to see how much slower they are...
;;--------------------------------------------------
;; 3BPP PACKED SPRITES
;;--------------------------------------------------
For the 3bpp we assume that the sprite data is stored in three byte packages.
One package contains pixel data needed for drawing four pixel bytes.
This way we save 25% of memory needed to store the sprite data.
We zero one bit pair of each pixel byte, so the number of different colors that can be used for the sprite graphics is limited to eight.
If we count how many of the three significant bit pairs of the four packed pixel bytes we store in each byte of a package, we get the following combinations:
1. |3-1|3-1|3-1|
2. |3-1|2-2|3-1|
3. |1-1-1-1|2-2|2-2|
4. |1-1-1-1|1-1-1-1|1-1-1-1|
Where, for example, 1-1-2 means that in this particular byte we store one bit pair from one pixel byte, one bit pair from one other pixel
byte and two bit pairs form yet another pixel byte.
In the package format desacription, pij means j-th pixel bit pair of the i-th pixel byte.
I present only the actual depacking-drawing code...
These are not complete sprite routines - you would have to add loops and next_line code.
Note that for the 3bpp and 1bpp packing, the sprite's width should be dividable by four and for 2bpp it should be dividable by two.
;;-------------------------------------
;; 3BPP - STORING METHOD: |1-1-1-1|2-2|2-2|
;;-------------------------------------
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p00_p02_p01_p10 || p12_p11_p32_p31 || p20_p21_p22_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b - used as an auxiliary register
;; c - preloaded with 11110000b bit mask
;;
ld b,(hl) ; get first byte of the package
; this byte contains all bit 0 pairs of the four pixel bytes
inc hl ; go to next byte of data
; extract first pixel byte
ld a,(hl) ; get second byte of package
; this byte contains bit 2 and 1 pairs of first two pixel bytes
and c ; mask to get only the data for the first pixel byte
rrc b ; now get the bit 0 pair for this pixel
; from the first group byte (stored in b)
; and get it into it's place in the accumulator
; we achieve this by rotating the b register right
; this loads it's least significant bit into carry
rra ; then we use rra to load the carry flag into
; the most significant bit of the accumulator
rrc b ; we have to do this twice
rra ; to move the pair of bits
ld (de),a ; save to screen
inc de ; go to next screen position
; extract second pixel byte
ld a,(hl) ; reload the second byte of package (we need to extract from it the bit pairs of the second pixel)
rlca ;
rlca ; rotate four times to swap the nibbles
rlca ; and get the pairs p12_p11 into place
rlca ;
and c ; mask to isolate bits for this pixel byte
rrc b ;
rra ; extract the bit 0 pair from the first byte of package
rrc b ; and load it to accumulator to complete the second pixel byte
rra ;
ld (de),a ; save to screen
inc de ; go to next screen position
inc hl ; go to next data byte
;
; for pixels three and four we repeat all the
; operations for the first two pixel bytes
;
; extract third pixel byte
ld a,(hl)
and c
rrc b
rra
rrc b
rra
ld (de),a
inc de
; extract fourth pixel byte
ld a,(hl)
rlca
rlca
rlca
rlca
and c
rrc b
rra
rrc b
rra
ld (de),a
inc de
inc hl
This code takes 68 nops (61 if aligned).
It's 17 nops per pixel byte.
;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|2-2|3-1|
;;-------------------------------------
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p00_p02_p01_p10 || p12_p11_p32_p31 || p20_p21_p22_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; unpack first pixel
ld c,(hl) ; load c with first byte of package
ld a,c ; we'll need it later;
and 11111100b ; mask to isolate the bits of the first pixel byte;
ld (de),a ; save to screen;
inc de ; next screen position;
inc hl ; go to next data byte;
; unpack second pixel
ld b,(hl) ; get second byte of package and
ld a,b ; store it in b for later use;
and 11110000b ; mask to isolate bit pair of the second pixel byte;
rrc c ; next four instructions
rra ; combine the p10 from first byte (stored in c)
rrc c ; with p12_p11 from second byte (in a)
rra ; to get the third pixel byte;
ld (de),a ; save to screen;
inc de
inc hl
; unpack third pixel
ld c,(hl) ; get third byte of package and
ld a,c ; store in c for later;
and 11111100b ; mask to isolate bits of the third pixel byte;
ld (de),a ; save to screen;
inc de
; unpack fourth pixel
ld a,b ; load a with the second byte
rlca ; rotate four times to swap the nibbles
rlca ; of the byte
rlca
rlca
and 11110000b ; isolate the bits of the fourth pixel;
rrc c ; move p30 from third byte
rra ; into bits 7 and 6 in accumulator
rrc c ; which contained bit pairs p32_p31;
rra ;
ld (de),a
inc de
inc hl
It's 56 nops (49 if aligned) per package.
This means 14(12.25) nops per pixel byte.
;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p00_p02_p01_p21 || p10_p12_p11_p22 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; extract first pixel byte
ld c,(hl) ; load c with the second byte of the package
ld a,c ; copy to accumulator
and 11111100b ; mask to isolate only the bits of the first pixel byte;
ld (de),a ; save to screen;
inc de ; next screen position;
inc hl ; go to next data byte;
; extract second pixel byte
ld b,(hl) ; load b with the second byte of the package
ld a,b ; copy to accumulator
and 11111100b ; mask to get only the bits of the second pixel byte;
ld (de),a ; save to screen;
inc de ; next screen position;
inc hl ; go to next data byte;
; extract third pixel byte
xor a
rrc c: rra: rrc c: rra ; get bit 1 pair from the first byte
rrc b: rra: rrc b: rra ; get bit 2 pair from the second byte
ld b,(hl) ; load b with the third byte of package
sla b: rra: sla b: rra ; get 0 bit pair from the second byte
; (we use sla to zero the rightmost bit pair of b)
ld (de),a ; save to screen;
inc de ; next screen position;
; extract fourth pixel byte
ld a,b ; copy the (now) rotated third byte to accumulator
ld (de),a ; save to screen;
inc de ; next screen position;
inc hl ; go to next data byte;
This takes 54 nops (47 if aligned) per package.
This means 13.5 (11.75) nops per byte.
This could be done 2 nops quicker and without using the b register:
;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p00_p02_p01_p21 || p10_p12_p11_p22 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; c used as an auxiliary register
;;
; unpack first pixel
ld c,(hl)
xor a
rrc c
rra
rrc c
rra
ex de,hl
ld (hl),c
ex de,hl
inc de
inc hl
; unpack second pixel
ld c,(hl)
rrc c
rra
rrc c
rra
ex de,hl
ld (hl),c
inc de
inc hl
; unpack third pixel
ex de,hl
ld c,(hl)
sla c
rra
sla c
rra
ld (de),a
inc de
; unpack fourth pixel
ld a,c
ld (de),a
inc de
inc hl
This is 52 nops (45 for aligned version) per package.
13 (11.25) nops per pixel byte.
A little later I found that it could be done another 2 nops quicker, if we change the package format a little.
However this time we make use of the b register...
;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p21_p00_p02_p01 || p22_p10_p12_p11 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; unpack first pixel
ld c,(hl) ; load c with the first byte of package
xor a ; zero the accumulator
sla c ; ...
rra ; shift bit pair p21 into the leftmost bit pair of the accumulator
sla c ; (a = p21-0-0-0, c = p00_p02_p01_0)
rra ; ...
inc hl ; go to next data byte;
; unpack second pixel
ld b,(hl) ; load b with the second byte of package
sla b ; ...
rra ; shift bit pair p22 into the leftmost bit pair of the accumulator
sla b ; (a = p22-p21-0-0, b = p10_p12_p11_0)
rra ; ...
inc hl ; go to next data byte;
; unpack third pixel
ex de,hl
ld (hl),c ; save 1st pixel byte to screen
inc hl ; go to next screen position
ld (hl),b ; save 2nd pixel byte to screen
inc hl ; go to next screen position
ex de,hl
ld c,(hl) ; load c with the last byte of package
sla c ; ...
rra ; shift bit pair p20 into the rightmost bit pair of the accumulator
sla c ; (a = p20-p21-p22-0, c = p30_p32_p31_0)
rra ; ...
ld (de),a ; save 3rd pixel byte to screen;
inc de ; next screen position;
; unpack fourth pixel
ld a,c ;
ld (de),a ; save last pixel byte to screen;
inc de ; next screen position;
inc hl ; go to next data byte;
This version takes 50 nops (43 for aligned version) per package.
12.5 nops per pixel byte (10.75 if aligned).
;;--------------------------------------------------
;; 3BPP - STORING METHOD: |1-1-1-1|1-1-1-1|1-1-1-1|
;;--------------------------------------------------
;;
;; package format:
;; 1st byte 2nd byte 3rd byte
;; | p31_p21_p11_p01 || p32_p22_p12_p02 || p30_p20_p10_p00
;;
;; hl'=^sprite
;; hl=^screen
;; b', c' and d' used as auxiliary registers
;;
; load b,c,d with the package bytes
exx
ld b,(hl)
inc hl
ld c,(hl)
inc hl
ld d,(hl)
inc hl
; extract first pixel byte
xor a
rrc b: rra: rrc b: rra
rrc c: rra: rrc c: rra
rrc d: rra: rrc d: rra
exx
ld (hl),a
inc hl
; extract second pixel byte
exx
xor a
rrc b: rra: rrc b: rra
rrc c: rra: rrc c: rra
rrc d: rra: rrc d: rra
exx
ld (hl),a
inc hl
; extract third pixel byte
exx
xor a
rrc b: rra: rrc b: rra
rrc c: rra: rrc c: rra
rrc d: rra: rrc d: rra
exx
ld (hl),a
inc hl
; extract fourth pixel byte
exx
xor a
rrc b: rra: rrc b: rra
rrc c: rra: rrc c: rra
rrc d: rra: rrc d: rra
exx
ld (hl),a
inc hl
This is very slow - 112 nops (105 if aligned) per package.
This means 28 nops per pixel byte.
;;--------------------------------------------------
;; 2BPP PACKED SPRITES
;;--------------------------------------------------
I more-or-less covered this in Dual playfield mode 0 sprites + packing (http://www.cpcwiki.eu/forum/programming/dual-playfield-mode-0-sprites-packing/)
The possibility to pack two versions of the same sprite (for flipping/shifting) in one seems especially interesting...
;;--------------------------------------------------
;; 1BPP PACKED SPRITES
;;--------------------------------------------------
Here we limit the number of colors used by the sprite to two.
This way, all we need for one pixel byte is one bit pair so we save 75% of needed memory.
So the "packages" are one byte and each stores data needed for four pixel bytes.
;;--------------------------------------------------
;; 1BPP
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; use inks 0 and 1 for sprite
;;
;; hl=^sprite
;; de=^screen
;; c - used as an auxiliary register
;;
; extract 1st pixel byte
xor a
ld c,(hl)
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 2nd pixel byte
xor a
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 3rd pixel byte
xor a
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 4th pixel byte
ld a,c
ld (de),a
inc de
inc hl
42(37) nops per package.
We could preload b with a "color offset" and then we could "choose" which of the possible color pairs we want to use for the sprite.
These possible pairs are:
0-1, 2-3, 4-5, 6-7, 8-9, a-b, c-d, e-f
But if we wanted to use transparency then we would have to set 0=2=4=6=8=a=c=e.
So only 8 different colors then.
;;--------------------------------------------------
;; 1BPP
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b = "palette chooser" [possible values: 2,4,6,8,10,12,14]
;; c - used as an auxiliary register
; extract 1st pixel byte
ld a,b
ld c,(hl)
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 2nd pixel byte
ld a,b
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 3rd pixel byte
ld a,b
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 4th pixel byte
ld a,c
or b
ld (de),a
inc de
inc hl
43(38) nops per package
Here's an idea for a 1 nop per package optimization of the first 1bpp example, but we must sacrifice some more colors...
;;--------------------------------------------------
;; 1BPP
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b,c - used as auxiliary registers
;;
;; use inks 0=8 and 1=9 for sprite (so we must sacrifice 2 colors for this optimization)
;;
; extract 3rd pixel byte
xor a
ld c,(hl)
sla c
rra
sla c
rra
ld (de),a
inc de
; extract 3rd pixel byte
sla c
adc a,a
sla c
adc a,a
ld (de),a
inc de
; extract 3rd pixel byte
ld b,a
sla c
rra
sla c
rra
ld (de),a
inc de
ld a,c
ld (de),a
inc de
inc hl
42(37) nops per package.
All presented examples could be slightly faster by using the stack as data source...
For comparison, standard drawing of four not-masked pixel bytes takes:
- 4*8=32 nops (24 if aligned) for ld version (allows masking etc.)
- 4*5=20 nops for ldi version
-----------------------------
Maybe someone has other ideas for packing sprite data?
So what is the average NOP speed that is acceptable?
Quote from: sigh on 12:10, 13 December 14
So what is the average NOP speed that is acceptable?
I'm not able to answer your question... I think (maybe this is stupid) that any speed is acceptibale if it allows you to do what you wanted within a reasonable number of frames... If what you wanted is feasible on the machine of course...
So it depends on how many and how big sprites you want to draw, what's your target framerate etc....
The quickest you could draw one opaque pixel byte without background saving is by using compiled sprites::
ld (hl),value
inc hl
So 5(4) nops per byte. You could make a little faster for some more frequent byte values if you preload some free registers you are left with:
ld (hl),c
inc hl
It's 4(3) nops. If you use stack-compiled sprites:
ld hl, pixelbyte_pair
push hl
This makes it 3.5 nop per byte. Using preloading you could get 2 nops per byte (one push) for some of the pixel bytes.
If you have only few smaller sprites on the screen then you don't worry about speed that much. And if these sprites have a lot of animation frames then maybe you would have to consider saving some memory... In that case you could consider using packed sprites....But note that you would be able to draw at least twice less of them in a single frame time and you would have to sacrifice some quoality of your graphics (less colors).
"Packing" can be used with standard and compressed sprites.