drawing packed mode 0 sprites

ssr86 · 17:25, 12 December 14

Here are some example routines for drawing packed mode 0 sprites.
I wanted to see how much slower they are...

;;--------------------------------------------------
;; 3BPP PACKED SPRITES
;;--------------------------------------------------

For the 3bpp we assume that the sprite data is stored in three byte packages.
One package contains pixel data needed for drawing four pixel bytes.
This way we save 25% of memory needed to store the sprite data.
We zero one bit pair of each pixel byte, so the number of different colors that can be used for the sprite graphics is limited to eight.

If we count how many of the three significant bit pairs of the four packed pixel bytes we store in each byte of a package, we get the following combinations:

1. |3-1|3-1|3-1|
2. |3-1|2-2|3-1|
3. |1-1-1-1|2-2|2-2|
4. |1-1-1-1|1-1-1-1|1-1-1-1|

Where, for example, 1-1-2 means that in this particular byte we store one bit pair from one pixel byte, one bit pair from one other pixel

byte and two bit pairs form yet another pixel byte.

In the package format desacription, pij means j-th pixel bit pair of the i-th pixel byte.

I present only the actual depacking-drawing code...
These are not complete sprite routines - you would have to add loops and next_line code.

Note that for the 3bpp and 1bpp packing, the sprite's width should be dividable by four and for 2bpp it should be dividable by two.

Code Select


;;-------------------------------------
;; 3BPP - STORING METHOD: |1-1-1-1|2-2|2-2|
;;-------------------------------------
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p00_p02_p01_p10 || p12_p11_p32_p31 || p20_p21_p22_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b - used as an auxiliary register
;; c - preloaded with 11110000b bit mask 
;;
         ld b,(hl)    ; get first byte of the package
                        ; this byte contains all bit 0 pairs of the four pixel bytes
     inc hl            ; go to next byte of data
; extract first pixel byte
     ld a,(hl)    ; get second byte of package
            ; this byte contains bit 2 and 1 pairs of first two pixel bytes
     and c        ; mask to get only the data for the first pixel byte
     rrc b        ; now get the bit 0 pair for this pixel
                        ; from the first group byte (stored in b)
            ; and get it into it's place in the accumulator
            ; we achieve this by rotating the b register right
            ; this loads it's least significant bit into carry
     rra        ; then we use rra to load the carry flag into
            ; the most significant bit of the accumulator
     rrc b        ; we have to do this twice
     rra        ; to move the pair of bits

         ld (de),a    ; save to screen
         inc de        ; go to next screen position
; extract second pixel byte
         ld a,(hl)      ; reload the second byte of package (we need to extract from it the bit pairs of the second pixel)
         rlca           ;
         rlca           ; rotate four times to swap the nibbles
         rlca           ; and get the pairs p12_p11 into place
         rlca           ;
         and c          ; mask to isolate bits for this pixel byte
         rrc b          ;
         rra            ; extract the bit 0 pair from the first byte of package
         rrc b          ; and load it to accumulator to complete the second pixel byte
         rra            ;
     ld (de),a    ; save to screen
     inc de        ; go to next screen position
     inc hl         ; go to next data byte
;
; for pixels three and four we repeat all the
; operations for the first two pixel bytes
;
; extract third pixel byte
    ld a,(hl)
    and c
    rrc b
    rra
    rrc b
    rra
    ld (de),a
    inc de
; extract fourth pixel byte
    ld a,(hl)
    rlca
    rlca
    rlca
    rlca
    and c
    rrc b
    rra
    rrc b
    rra
    ld (de),a
    inc de
    inc hl

This code takes 68 nops (61 if aligned).
It's 17 nops per pixel byte.

Code Select


;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|2-2|3-1|
;;-------------------------------------
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p00_p02_p01_p10 || p12_p11_p32_p31 || p20_p21_p22_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; unpack first pixel
    ld c,(hl)    ; load c with first byte of package 
    ld a,c        ; we'll need it later;
    and 11111100b    ; mask to isolate the bits of the first pixel byte;
    ld (de),a    ; save to screen;
    inc de        ; next screen position;
    inc hl        ; go to next data byte;
; unpack second pixel
    ld b,(hl)    ; get second byte of package and 
    ld a,b        ; store it in b for later use;
    and 11110000b    ; mask to isolate bit pair of the second pixel byte;
    rrc c        ; next four instructions
    rra        ; combine the p10 from first byte (stored in c)
    rrc c        ; with p12_p11 from second byte (in a)
    rra        ; to get the third pixel byte;
    ld (de),a    ; save to screen;
    inc de
    inc hl
; unpack third pixel
    ld c,(hl)    ; get third byte of package and 
    ld a,c        ; store in c for later;
    and 11111100b    ; mask to isolate bits of the third pixel byte;
    ld (de),a    ; save to screen;
    inc de
; unpack fourth pixel
    ld a,b        ; load a with the second byte
    rlca        ; rotate four times to swap the nibbles
    rlca        ; of the byte
    rlca
    rlca
    and 11110000b    ; isolate the bits of the fourth pixel;
    rrc c        ; move p30 from third byte
    rra        ; into bits 7 and 6 in accumulator
    rrc c        ; which contained bit pairs p32_p31; 
    rra        ; 
    ld (de),a
    inc de
    inc hl

It's 56 nops (49 if aligned) per package.
This means 14(12.25) nops per pixel byte.

Code Select


;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p00_p02_p01_p21 || p10_p12_p11_p22 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; extract first pixel byte
    ld c,(hl)        ; load c with the second byte of the package
    ld a,c            ; copy to accumulator
    and 11111100b        ; mask to isolate only the bits of the first pixel byte;
    ld (de),a        ; save to screen;
    inc de            ; next screen position;
    inc hl            ; go to next data byte;
; extract second pixel byte
    ld b,(hl)        ; load b with the second byte of the package
    ld a,b            ; copy to accumulator
    and 11111100b        ; mask to get only the bits of the second pixel byte;
    ld (de),a        ; save to screen;
    inc de            ; next screen position;
    inc hl            ; go to next data byte;
; extract third pixel byte
    xor a
    rrc c: rra: rrc c: rra    ; get bit 1 pair from the first byte 
    rrc b: rra: rrc b: rra    ; get bit 2 pair from the second byte
    ld b,(hl)        ; load b with the third byte of package 
    sla b: rra: sla b: rra    ; get 0 bit pair from the second byte
                ; (we use sla to zero the rightmost bit pair of b)
    ld (de),a        ; save to screen;
    inc de            ; next screen position;
; extract fourth pixel byte
    ld a,b            ; copy the (now) rotated third byte to accumulator
    ld (de),a        ; save to screen;
    inc de            ; next screen position;
    inc hl            ; go to next data byte;

This takes 54 nops (47 if aligned) per package.
This means 13.5 (11.75) nops per byte.

This could be done 2 nops quicker and without using the b register:

Code Select


;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p00_p02_p01_p21 || p10_p12_p11_p22 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; c used as an auxiliary register
;;
; unpack first pixel
    ld c,(hl)
      xor a
      rrc c
      rra
      rrc c
      rra
      ex de,hl
      ld (hl),c
      ex de,hl
    inc de        
    inc hl        
; unpack second pixel
    ld c,(hl)     
      rrc c
      rra
      rrc c
      rra
    ex de,hl
    ld (hl),c
    inc de        
    inc hl        
; unpack third pixel
    ex de,hl
    ld c,(hl)     
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; unpack fourth pixel
    ld a,c         
    ld (de),a     
    inc de         
    inc hl

This is 52 nops (45 for aligned version) per package.
13 (11.25) nops per pixel byte.

A little later I found that it could be done another 2 nops quicker, if we change the package format a little.
However this time we make use of the b register...

Code Select


;;-------------------------------------
;; 3BPP - STORING METHOD: |3-1|3-1|3-1|
;;-------------------------------------
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p21_p00_p02_p01 || p22_p10_p12_p11 || p20_p30_p32_p31 |
;;
;; hl=^sprite
;; de=^screen
;; b, c used as auxiliary registers
;;
; unpack first pixel
    ld c,(hl)    ; load c with the first byte of package
      xor a        ; zero the accumulator
      sla c        ; ...
      rra        ; shift bit pair p21 into the leftmost bit pair of the accumulator 
      sla c        ; (a = p21-0-0-0, c = p00_p02_p01_0)
      rra        ; ...
    inc hl        ; go to next data byte;
; unpack second pixel
    ld b,(hl)    ; load b with the second byte of package
      sla b        ; ...
      rra        ; shift bit pair p22 into the leftmost bit pair of the accumulator 
      sla b        ; (a = p22-p21-0-0, b = p10_p12_p11_0)
      rra        ; ...
    inc hl        ; go to next data byte;
; unpack third pixel
    ex de,hl
    ld (hl),c    ; save 1st pixel byte to screen
    inc hl        ; go to next screen position
    ld (hl),b    ; save 2nd pixel byte to screen
    inc hl        ; go to next screen position
    ex de,hl
    ld c,(hl)    ; load c with the last byte of package
    sla c        ; ...
    rra        ; shift bit pair p20 into the rightmost bit pair of the accumulator 
    sla c        ; (a = p20-p21-p22-0, c = p30_p32_p31_0)
    rra        ; ...
    ld (de),a    ; save 3rd pixel byte to screen;
    inc de        ; next screen position;
; unpack fourth pixel
    ld a,c        ; 
    ld (de),a    ; save last pixel byte to screen;
    inc de        ; next screen position;
    inc hl        ; go to next data byte;

This version takes 50 nops (43 for aligned version) per package.
12.5 nops per pixel byte (10.75 if aligned).

Code Select


;;--------------------------------------------------
;; 3BPP - STORING METHOD:  |1-1-1-1|1-1-1-1|1-1-1-1|
;;--------------------------------------------------
;;
;; package format:
;;       1st byte           2nd byte           3rd byte
;; | p31_p21_p11_p01 || p32_p22_p12_p02 || p30_p20_p10_p00
;;
;; hl'=^sprite
;; hl=^screen
;; b', c' and d' used as auxiliary registers
;; 
; load b,c,d with the package bytes
    exx
    ld b,(hl)
    inc hl
    ld c,(hl)
    inc hl
    ld d,(hl)
    inc hl
; extract first pixel byte
    xor a
    rrc b: rra: rrc b: rra
    rrc c: rra: rrc c: rra
    rrc d: rra: rrc d: rra
    exx
    ld (hl),a
    inc hl
; extract second pixel byte
    exx
    xor a
    rrc b: rra: rrc b: rra
    rrc c: rra: rrc c: rra
    rrc d: rra: rrc d: rra
    exx
    ld (hl),a
    inc hl
; extract third pixel byte
    exx
    xor a
    rrc b: rra: rrc b: rra
    rrc c: rra: rrc c: rra
    rrc d: rra: rrc d: rra
    exx
    ld (hl),a
    inc hl
; extract fourth pixel byte
    exx
    xor a
    rrc b: rra: rrc b: rra
    rrc c: rra: rrc c: rra
    rrc d: rra: rrc d: rra
    exx
    ld (hl),a
    inc hl

This is very slow - 112 nops (105 if aligned) per package.
This means 28 nops per pixel byte.

;;--------------------------------------------------
;; 2BPP PACKED SPRITES
;;--------------------------------------------------

I more-or-less covered this in Dual playfield mode 0 sprites + packing
The possibility to pack two versions of the same sprite (for flipping/shifting) in one seems especially interesting...

;;--------------------------------------------------
;; 1BPP PACKED SPRITES
;;--------------------------------------------------

Here we limit the number of colors used by the sprite to two.
This way, all we need for one pixel byte is one bit pair so we save 75% of needed memory.
So the "packages" are one byte and each stores data needed for four pixel bytes.

Code Select


;;--------------------------------------------------
;; 1BPP 
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; use inks 0 and 1 for sprite
;;
;; hl=^sprite
;; de=^screen
;; c - used as an auxiliary register
;;
; extract 1st pixel byte
    xor a
    ld c,(hl)
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 2nd pixel byte
    xor a
    sla c  
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 3rd pixel byte
    xor a
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 4th pixel byte
    ld a,c
    ld (de),a
    inc de
    inc hl

42(37) nops per package.

We could preload b with a "color offset" and then we could "choose" which of the possible color pairs we want to use for the sprite.
These possible pairs are:
0-1, 2-3, 4-5, 6-7, 8-9, a-b, c-d, e-f
But if we wanted to use transparency then we would have to set 0=2=4=6=8=a=c=e.
So only 8 different colors then.

Code Select


;;--------------------------------------------------
;; 1BPP 
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b = "palette chooser" [possible values: 2,4,6,8,10,12,14]
;; c - used as an auxiliary register
; extract 1st pixel byte
    ld a,b
    ld c,(hl)
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 2nd pixel byte
    ld a,b
    sla c  
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 3rd pixel byte
    ld a,b
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 4th pixel byte
    ld a,c
    or b
    ld (de),a
    inc de
    inc hl

43(38) nops per package

Here's an idea for a 1 nop per package optimization of the first 1bpp example, but we must sacrifice some more colors...

Code Select


;;--------------------------------------------------
;; 1BPP 
;;--------------------------------------------------
;; package format:
;; | p00_p10_p20_p30 |
;;
;; hl=^sprite
;; de=^screen
;; b,c - used as auxiliary registers
;;
;; use inks 0=8 and 1=9 for sprite (so we must sacrifice 2 colors for this optimization)
;;
; extract 3rd pixel byte
    xor a
    ld c,(hl)
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
; extract 3rd pixel byte
    sla c  
    adc a,a
    sla c
    adc a,a
    ld (de),a
    inc de
; extract 3rd pixel byte
    ld b,a
    sla c
    rra
    sla c
    rra
    ld (de),a
    inc de
    ld a,c
    ld (de),a
    inc de
    inc hl

42(37) nops per package.

All presented examples could be slightly faster by using the stack as data source...

For comparison, standard drawing of four not-masked pixel bytes takes:
- 4*8=32 nops (24 if aligned) for ld version (allows masking etc.)
- 4*5=20 nops for ldi version

-----------------------------

Maybe someone has other ideas for packing sprite data?

sigh · 12:10, 13 December 14

So what is the average NOP speed that is acceptable?

ssr86 · 14:14, 13 December 14

Quote from: sigh on 12:10, 13 December 14
So what is the average NOP speed that is acceptable?

I'm not able to answer your question... I think (maybe this is stupid) that any speed is acceptibale if it allows you to do what you wanted within a reasonable number of frames... If what you wanted is feasible on the machine of course...
So it depends on how many and how big sprites you want to draw, what's your target framerate etc....

The quickest you could draw one opaque pixel byte without background saving is by using compiled sprites::

Code Select


ld (hl),value
inc hl

So 5(4) nops per byte. You could make a little faster for some more frequent byte values if you preload some free registers you are left with:

Code Select


ld (hl),c
inc hl

It's 4(3) nops. If you use stack-compiled sprites:

Code Select


ld  hl, pixelbyte_pair
push hl

This makes it 3.5 nop per byte. Using preloading you could get 2 nops per byte (one push) for some of the pixel bytes.

If you have only few smaller sprites on the screen then you don't worry about speed that much. And if these sprites have a lot of animation frames then maybe you would have to consider saving some memory... In that case you could consider using packed sprites....But note that you would be able to draw at least twice less of them in a single frame time and you would have to sacrifice some quoality of your graphics (less colors).

"Packing" can be used with standard and compressed sprites.

News:

drawing packed mode 0 sprites

ssr86

sigh

ssr86