Optmising compressed Plus sprites

redbox · 16:21, 02 October 10

I have written a routine that copies a Plus hardware sprite stored in a compressed format (i.e. instead of &01,&02,&03,&04 etc it's stored as &12, &34) to the ASIC. Can anyone make it any faster...?

Code Select


        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location 

copy_spr_asic:    
        ld b,128         ;number of compressed pieces of data for sprite

spr_asic_loop:    
        ld a,(hl)        ;get byte of compressed data
        sra a            ;shift the bits right
        sra a            ;4 times to, for example this
        sra a            ;turns &17 into &01
        sra a

        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc de           ;increase ASIC location to next byte

        ld a,(hl)        ;get same byte of compressed data
        and %00001111    ;and delete bits 7 to 4, e.g. turns &17 into &07

        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc de           ;increase ASIC location to next byte

        inc hl           ;move onto next byte of compressed data

        djnz spr_asic_loop

        ret

And why we're on the subject of optimisation, does anyone know how to convert T-states into microseconds for the CPC?

fano · 16:34, 02 October 10

Due to architecture constraints , all instructions timings are multiples of 4 tstates on CPC.4 tstates = 1 µs = 1 NOP = 1 CRTC char width = 1 Mode 1 Char width

This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)

http://www.grimware.org/doku.php/documentations/devices/z80

I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222 and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.

redbox · 17:14, 02 October 10

Quote from: fano on 16:34, 02 October 10
This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)
http://www.grimware.org/doku.php/documentations/devices/z80

Thanks for the link, I should have thought to look in Grim's website

Where is the NOP counter in WinAPE? The help file doesn't load on my version (I'm annoyingly using Windows Vista on my home laptop and the help file doesn't load on it).

Quote from: fano on 16:34, 02 October 10
I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222 and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.

That's really great Fano, thanks. I did wonder about whether the ASIC bothered with the upper 4 bits! Am glad I wrote my own capture routine now because I can easily adjust it to write the data in the reversed format.

fano · 17:34, 02 October 10

Look at the picture , it is under registers where is "T 0" , the red cross is to reset the counter.

TFM · 22:30, 02 October 10

Ok, guys, nice so far

Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS

redbox · 23:01, 02 October 10

Quote from: TFM/FS on 22:30, 02 October 10
Ok, guys, nice so far Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS

Yes, that's a good point!

I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.

So now we have:

Code Select


;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc

        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location 

copy_spr_asic:    
        ld b,128         ;number of compressed pieces of data for sprite

spr_asic_loop:    
        ld a,(hl)        ;get byte of reverse compressed data
        ld (de),a        ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
        inc de           ;increase ASIC location to next byte

        rrca             ;shift the bits right
        rrca             ;4 times, for example this
        rrca             ;turns &37 into &03
        rrca
        ld (de),a        ;copy it to the ASIC
        inc de

        inc hl           ;move onto next byte of compressed data

        djnz spr_asic_loop

Axelay · 07:06, 03 October 10

You should be able to replace those 3 16bit incs with 8 bit incs as well, or at least 2 of them if you don't want to keep the source data page aligned.

TFM · 19:56, 03 October 10

Quote from: redbox on 23:01, 02 October 10
Yes, that's a good point!
I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.

It's a feature, not a bug, but I'm sure at the beginning they were planning 256 colors, then the "management" told the developpers "Use only half of the ASIC RAM, we save the other part". But what managers don't know is if you take only half of the RAM (4 of 8 bits) then you have only 1/16 of the colors (16 instead of 256) - what a pity.

arnoldemu · 09:33, 05 October 10

Unroll the code.
Use a lookup table to get shifted pixel data (256 bytes for lookup table

)
inc de -> inc e
use asics "auto and &f"
if hl is aligned you can also use inc l

Code Select


        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location 

rept 128
        ld a,(hl)        ;get byte of compressed data
        ld c,a
        ld a,(bc)   ; use lookup table to get shifted version
        ld (de),a
        inc e           ;increase ASIC location to next byte
        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc e           ;increase ASIC location to next byte

        inc hl           ;move onto next byte of compressed data
endm

redbox · 10:09, 05 October 10

Quote from: arnoldemu on 09:33, 05 October 10
Use a lookup table to get shifted pixel data (256 bytes for lookup table )
if hl is aligned you can also use inc l

This is fast, but obviously at the expense of size of code.

What would the lookup table be like? I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?

Also, what does 'page-aligned' mean? Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...?

arnoldemu · 10:30, 05 October 10

Quote from: redbox on 10:09, 05 October 10
This is fast, but obviously at the expense of size of code.

What would the lookup table be like? I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?

Also, what does 'page-aligned' mean? Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...?

The table would be 256 bytes long (one value for each of the possible values in the compressed data). It would effectively store the value of the compressed data shifted to the right 4 times. so &a3 would result in &0a, and &a0 would also result in &0a.
This table would be initialised at the beginning of your program and never modified after. It could go into ROM if the game was cartridge based.

The table should be positioned in ram so that it's lowest byte is 0. It is then aligned to a 256-byte boundary, e.g. it's start is a multiple of 256.
Then, we only need to do LD B,table/256 to set it's location for the code.

When we load the compressed data, this forms the lower 8-bits of the address. We load the data into the C register. Now we have formed the address in the table. We then read from this address (into A register) to get the shifted pixel. C remains unchanged.
We can write the shifted pixel. Then we can use C itself and write that to ram, knowing that the asic will AND the data.
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.

Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.

So, a compressed sprite is 128 bytes. If you located it so it's lowest 8 bits were 0 or &80, you could then use LD HL, to set the initial address and then INC L to move through the data, knowing that when you increment L it will never go past 256 and will then never cause H to be modified.
So by doing both of this, now you can use an instruction that is 2 times faster

redbox · 12:17, 05 October 10

Quote from: arnoldemu on 10:30, 05 October 10
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.
Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.

I understand how the routine and page-aligning works now, many thanks for the explanations

I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?

So here are the two routines fully optimised:

Code Select


;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc
;49 x 128 = 6272 T-states = 1568 microseconds

            org &8000

            ld hl,&2000          ;stored sprite data location (page-aligned)
            ld de,&4000          ;address of ASIC sprite location

repeat 128
            ld a,(hl)            ;get byte of reverse compressed data
            ld (de),a            ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
            inc e                ;increase ASIC location to next byte

            rrca                 ;shift the bits right
            rrca                 ;4 times, for example this
            rrca                 ;turns &37 into &03
            rrca
            ld (de),a            ;copy it to the ASIC
            inc e

            inc l               ;move onto next byte of compressed data
endm

            ret

Code Select


;Display Plus hardware sprite stored in compressed format
;e.g. &01,&02,&03,&04 etc is stored as &12,&34 etc
;48 x 128 = 6144 T-states = 1536 microseconds

            org &8000

            ld hl,&2000          ;stored sprite data location (page-aligned)
            ld de,&4000          ;address of ASIC sprite location
            ld b,table/256       ;table (page-aligned)

repeat 128
            ld a,(hl)            ;get byte of compressed data            
            ld c,a               ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)                        
            ld a,(bc)            ;use lookup table to get shifted version    

            ld (de),a
            inc e                ;increase ASIC location to next byte
            ld a,c
            ld (de),a            ;copy it to the ASIC in uncompressed form
            inc e                ;increase ASIC location to next byte

            inc l                ;move onto next byte of compressed data
endm

            ret

            org &9000

table:        defb &00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00
              defb &01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01
              defb &02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02
              defb &03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03
              defb &04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04
              defb &05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05
              defb &06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06
              defb &07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07
              defb &08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08
              defb &09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09
              defb &0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A
              defb &0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B
              defb &0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C
              defb &0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D
              defb &0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E
              defb &0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F

So the second routine is slightly faster but has the overhead of the table...!

arnoldemu · 12:20, 05 October 10

Quote from: redbox on 12:17, 05 October 10
I understand how the routine and page-aligning works now, many thanks for the explanations

I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?

I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.

redbox · 12:40, 05 October 10

Quote from: arnoldemu on 12:20, 05 October 10
I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.

You mean you could use LD (HL),C

But yes, this would revise it to 44 x 128 = 5632 T-states = 1408 microseconds, which would make it significantly faster than the other routine (10.21% faster)

Grim · 12:56, 05 October 10

Below is an alternative version using the stack to fetch and process 2 bytes of packed data in one go, which is a little bit faster but comes with constraints on the interrupts (which might be a no-go in some cases).

Code Select


			org &1000
			run $,spriteDepack

asic_sprite0		equ &4000

			; Fill the lookup table
			call spriteDepack_init
			; Depack sprite
			ld hl,data_sprite
			ld de,asic_sprite0
			call spriteDepack
			ret


			; Copy packed sprite into the asic RAM
			;
			; Input
			;  HL = address of packed sprite data
			;  DE = address of ASIC sprite
spriteDepack:
			; disable interrupts and init the stack
			di
			ld (spriteBlit_var_sp),sp
			ld sp,hl
			ex de,hl

			; Init the lookup table pointer
			ld d,lut_depackSprite / 256

			repeat 64
				; fetch 4 packed pixels
				pop bc		;3
		
				; write pixel 1
				ld (hl),c	;2
				inc l		;1
				; lookup pixel 2
				ld e,c		;1
				ld a,(de)	;2
				; write pixel 2
				ld (hl),a	;2
				inc l		;1
				; write pixel 3
				ld (hl),b	;2
				inc l		;1
				; lookup pixel 4
				ld e,b		;1
				ld a,(de)	;2
				; write pixel 4
				ld (hl),a	;2
				inc l		;1 (<- will be overwritten with the ORG adjustment below)
						;= 21us @ 4 pixels
			rend			;= 21*64 = 1344us
						

			org $-1			; move one byte back to overwrite an useless inc l
spriteBlit_var_sp	equ $+1
			ld sp,0
			ei
			ret			; 845 bytes... uuwwh!


			; Fill the 256 bytes lookup table
spriteDepack_init:
			ld hl,lut_depackSprite
spriteDepack_init_loop
			ld a,l
			rrca
			rrca
			rrca
			rrca
			;and %1111 ; would be more meaningful than useful =)
			ld (hl),a
			inc l
			jr nz,spriteDepack_init_loop
			ret


data_sprite		; some packed sprite data here


			align 256
lut_depackSprite		; lookup table here

redbox · 13:57, 05 October 10

That's an interesting way to take it further Grim, but will this affect the DMA if you are using it on the Plus...?

This process has been incredibly interesting for me as I've learnt new techniques and also generally how to optimize code and I'm currently going over lots of routines and giving them the treatment

One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame!

Axelay · 14:15, 05 October 10

Quote from: redbox on 13:57, 05 October 10
One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame!

Every project I've worked on so far has only required animation at 1/4 of 50hz at most. Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.

redbox · 14:29, 05 October 10

Quote from: Axelay on 14:15, 05 October 10
Every project I've worked on so far has only required animation at 1/4 of 50hz at most. Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.

I will give that a go and let you know!

Grim · 15:58, 05 October 10

Quote from: redbox on 13:57, 05 October 10will this affect the DMA if you are using it on the Plus...?

It should not interfere in any way with the DMA (except if you're using DMA-interrupts where strange things could happen :).

QuoteOne other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame! :o

Indeed, even using raw sprite data and an uber-fast-copy routine, it takes a little bit more than one frame to update all the hardware sprites.

TFM · 21:50, 05 October 10

You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction

Grim · 22:44, 05 October 10

Quote from: TFM/FS on 21:50, 05 October 10
You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction ;)

Errr... What?

One frame is 312*64 = 19968us with regular CRTC settings.

16 sprites
256 bytes each
LDI = 5us / byte

16*256*5 = 20480us = 20.5ms

Am I missing something? (or... is that LDI runs in 4us in your parallel universe? :)

TFM · 01:53, 06 October 10

It does

arnoldemu · 09:34, 07 October 10

Quote from: TFM/FS on 01:53, 06 October 10
It does

I remember you've got a 6Mhz CPC?
Has anyone tried making a 6Mhz Plus yet?

Sykobee (Briggsy) · 14:47, 07 October 10

Quote from: arnoldemu on 09:34, 07 October 10
Has anyone tried making a 6Mhz Plus yet?

Why not skip straight to a 16MHz Z80 CPU? I think the CPC used a 16MHz base clock, divided by four for the Z80 so you'd want to bypass that bit of logic.

Wonder if anyone has a Minimig compatible CPC FPGA implementation yet.

steve · 15:42, 07 October 10

Not (yet) cpc compatible, is the v6z8op+ available in very small quantity at retroleum, it cost £85 + p+p, it uses a 16mhz z80 and a FPGA to generate a VGA display with blitter and sound, it has 1152KB ram.

I have no idea how to reprogram the FPGA to make it cpc compatible but it is theoretically possible.

The designer of this board is now working on an eZ80 system which I will also get when it becomes available.

The eZ80 runs at up to 50 mhz which would be equivalent to a z80 running at 200mhz, who needs core i7

There is a free tcp/ip stack written for this chip so net access should be possible, once a browser has been ported.

News:

Optmising compressed Plus sprites