News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_redbox

Optmising compressed Plus sprites

Started by redbox, 16:21, 02 October 10

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

redbox

I have written a routine that copies a Plus hardware sprite stored in a compressed format (i.e. instead of &01,&02,&03,&04 etc it's stored as &12, &34) to the ASIC.  Can anyone make it any faster...?


        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location

copy_spr_asic:   
        ld b,128         ;number of compressed pieces of data for sprite

spr_asic_loop:   
        ld a,(hl)        ;get byte of compressed data
        sra a            ;shift the bits right
        sra a            ;4 times to, for example this
        sra a            ;turns &17 into &01
        sra a

        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc de           ;increase ASIC location to next byte

        ld a,(hl)        ;get same byte of compressed data
        and %00001111    ;and delete bits 7 to 4, e.g. turns &17 into &07

        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc de           ;increase ASIC location to next byte

        inc hl           ;move onto next byte of compressed data

        djnz spr_asic_loop

        ret


And why we're on the subject of optimisation, does anyone know how to convert T-states into microseconds for the CPC?

fano

#1
Due to architecture constraints , all instructions timings are multiples of 4 tstates on CPC.4 tstates = 1 µs = 1 NOP = 1 CRTC char width = 1 Mode 1 Char width

This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)

http://www.grimware.org/doku.php/documentations/devices/z80

I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222  and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.
"NOP" is the perfect program : short , fast and (known) bug free

Follow Easter Egg products on Facebook !

redbox

Quote from: fano on 16:34, 02 October 10
This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)
http://www.grimware.org/doku.php/documentations/devices/z80

Thanks for the link, I should have thought to look in Grim's website  ;)   Where is the NOP counter in WinAPE?  The help file doesn't load on my version (I'm annoyingly using Windows Vista on my home laptop and the help file doesn't load on it).

Quote from: fano on 16:34, 02 October 10
I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222  and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.

That's really great Fano, thanks.  I did wonder about whether the ASIC bothered with the upper 4 bits!  Am glad I wrote my own capture routine now because I can easily adjust it to write the data in the reversed format.

fano

Look at the picture , it is under registers where is "T 0" , the red cross is to reset the counter.
"NOP" is the perfect program : short , fast and (known) bug free

Follow Easter Egg products on Facebook !

TFM

Ok, guys, nice so far  :)  Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS  ;) 8) :laugh:
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

redbox

Quote from: TFM/FS on 22:30, 02 October 10
Ok, guys, nice so far  :)  Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS  ;) 8) :laugh:

Yes, that's a good point!

I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.

So now we have:


;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc

        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location

copy_spr_asic:   
        ld b,128         ;number of compressed pieces of data for sprite

spr_asic_loop:   
        ld a,(hl)        ;get byte of reverse compressed data
        ld (de),a        ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
        inc de           ;increase ASIC location to next byte

        rrca             ;shift the bits right
        rrca             ;4 times, for example this
        rrca             ;turns &37 into &03
        rrca
        ld (de),a        ;copy it to the ASIC
        inc de

        inc hl           ;move onto next byte of compressed data

        djnz spr_asic_loop


Axelay

You should be able to replace those 3 16bit incs with 8 bit incs as well, or at least 2 of them if you don't want to keep the source data page aligned.

TFM

Quote from: redbox on 23:01, 02 October 10
Yes, that's a good point!
I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.

It's a feature, not a bug, but I'm sure at the beginning they were planning 256 colors, then the "management" told the developpers "Use only half of the ASIC RAM, we save the other part". But what managers don't know is if you take only half of the RAM (4 of 8 bits) then you have only 1/16 of the colors (16 instead of 256) - what a pity.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

arnoldemu

Unroll the code.
Use a lookup table to get shifted pixel data (256 bytes for lookup table ;) )
inc de -> inc e
use asics "auto and &f"
if hl is aligned you can also use inc l



        ld hl,&2000      ;stored sprite data location
        ld de,&4000      ;address of ASIC sprite location

rept 128
        ld a,(hl)        ;get byte of compressed data
        ld c,a
        ld a,(bc)   ; use lookup table to get shifted version
        ld (de),a
        inc e           ;increase ASIC location to next byte
        ld (de),a        ;copy it to the ASIC in uncompressed form
        inc e           ;increase ASIC location to next byte

        inc hl           ;move onto next byte of compressed data
endm

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

redbox

Quote from: arnoldemu on 09:33, 05 October 10
Use a lookup table to get shifted pixel data (256 bytes for lookup table ;) )
if hl is aligned you can also use inc l

This is fast, but obviously at the expense of size of code.

What would the lookup table be like?  I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?

Also, what does 'page-aligned' mean?  Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...?  :(

arnoldemu

Quote from: redbox on 10:09, 05 October 10
This is fast, but obviously at the expense of size of code.

What would the lookup table be like?  I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?

Also, what does 'page-aligned' mean?  Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...?  :(

The table would be 256 bytes long (one value for each of the possible values in the compressed data). It would effectively store the value of the compressed data shifted to the right 4 times. so &a3 would result in &0a, and &a0 would also result in &0a.
This table would be initialised at the beginning of your program and never modified after. It could go into ROM if the game was cartridge based.

The table should be positioned in ram so that it's lowest byte is 0. It is then aligned to a 256-byte boundary, e.g. it's start is a multiple of 256.
Then, we only need to do LD B,table/256 to set it's location for the code.

When we load the compressed data, this forms the lower 8-bits of the address. We load the data into the C register. Now we have formed the address in the table. We then read from this address (into A register) to get the shifted pixel. C remains unchanged.
We can write the shifted pixel. Then we can use C itself and write that to ram, knowing that the asic will AND the data.
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.

Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.

So, a compressed sprite is 128 bytes. If you located it so it's lowest 8 bits were 0 or &80, you could then use LD HL, to set the initial address and then INC L to move through the data, knowing that when you increment L it will never go past 256 and will then never cause H to be modified.
So by doing both of this, now you can use an instruction that is 2 times faster :)
My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

redbox

Quote from: arnoldemu on 10:30, 05 October 10
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.
Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.

I understand how the routine and page-aligning works now, many thanks for the explanations  :)

I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?

So here are the two routines fully optimised:


;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc
;49 x 128 = 6272 T-states = 1568 microseconds

            org &8000

            ld hl,&2000          ;stored sprite data location (page-aligned)
            ld de,&4000          ;address of ASIC sprite location

repeat 128
            ld a,(hl)            ;get byte of reverse compressed data
            ld (de),a            ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
            inc e                ;increase ASIC location to next byte

            rrca                 ;shift the bits right
            rrca                 ;4 times, for example this
            rrca                 ;turns &37 into &03
            rrca
            ld (de),a            ;copy it to the ASIC
            inc e

            inc l               ;move onto next byte of compressed data
endm

            ret



;Display Plus hardware sprite stored in compressed format
;e.g. &01,&02,&03,&04 etc is stored as &12,&34 etc
;48 x 128 = 6144 T-states = 1536 microseconds

            org &8000

            ld hl,&2000          ;stored sprite data location (page-aligned)
            ld de,&4000          ;address of ASIC sprite location
            ld b,table/256       ;table (page-aligned)

repeat 128
            ld a,(hl)            ;get byte of compressed data           
            ld c,a               ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)                       
            ld a,(bc)            ;use lookup table to get shifted version   

            ld (de),a
            inc e                ;increase ASIC location to next byte
            ld a,c
            ld (de),a            ;copy it to the ASIC in uncompressed form
            inc e                ;increase ASIC location to next byte

            inc l                ;move onto next byte of compressed data
endm

            ret

            org &9000

table:        defb &00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00
              defb &01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01
              defb &02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02
              defb &03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03
              defb &04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04
              defb &05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05
              defb &06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06
              defb &07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07
              defb &08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08
              defb &09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09
              defb &0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A
              defb &0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B
              defb &0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C
              defb &0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D
              defb &0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E
              defb &0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F


So the second routine is slightly faster but has the overhead of the table...!

arnoldemu

Quote from: redbox on 12:17, 05 October 10
I understand how the routine and page-aligning works now, many thanks for the explanations  :)

I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?
I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

redbox

Quote from: arnoldemu on 12:20, 05 October 10
I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.

You mean you could use LD (HL),C  ;)

But yes, this would revise it to 44 x 128 = 5632 T-states = 1408 microseconds, which would make it significantly faster than the other routine (10.21% faster)  :)

Grim

Below is an alternative version using the stack to fetch and process 2 bytes of packed data in one go, which is a little bit faster but comes with constraints on the interrupts (which might be a no-go in some cases).


org &1000
run $,spriteDepack

asic_sprite0 equ &4000

; Fill the lookup table
call spriteDepack_init
; Depack sprite
ld hl,data_sprite
ld de,asic_sprite0
call spriteDepack
ret


; Copy packed sprite into the asic RAM
;
; Input
;  HL = address of packed sprite data
;  DE = address of ASIC sprite
spriteDepack:
; disable interrupts and init the stack
di
ld (spriteBlit_var_sp),sp
ld sp,hl
ex de,hl

; Init the lookup table pointer
ld d,lut_depackSprite / 256

repeat 64
; fetch 4 packed pixels
pop bc ;3

; write pixel 1
ld (hl),c ;2
inc l ;1
; lookup pixel 2
ld e,c ;1
ld a,(de) ;2
; write pixel 2
ld (hl),a ;2
inc l ;1
; write pixel 3
ld (hl),b ;2
inc l ;1
; lookup pixel 4
ld e,b ;1
ld a,(de) ;2
; write pixel 4
ld (hl),a ;2
inc l ;1 (<- will be overwritten with the ORG adjustment below)
;= 21us @ 4 pixels
rend ;= 21*64 = 1344us


org $-1 ; move one byte back to overwrite an useless inc l
spriteBlit_var_sp equ $+1
ld sp,0
ei
ret ; 845 bytes... uuwwh!


; Fill the 256 bytes lookup table
spriteDepack_init:
ld hl,lut_depackSprite
spriteDepack_init_loop
ld a,l
rrca
rrca
rrca
rrca
;and %1111 ; would be more meaningful than useful =)
ld (hl),a
inc l
jr nz,spriteDepack_init_loop
ret


data_sprite ; some packed sprite data here


align 256
lut_depackSprite ; lookup table here

redbox

That's an interesting way to take it further Grim, but will this affect the DMA if you are using it on the Plus...?

This process has been incredibly interesting for me as I've learnt new techniques and also generally how to optimize code and I'm currently going over lots of routines and giving them the treatment  :)   

One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame!  :o

Axelay

Quote from: redbox on 13:57, 05 October 10
One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame!  :o

Every project I've worked on so far has only required animation at 1/4 of 50hz at most.  Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.

redbox

Quote from: Axelay on 14:15, 05 October 10
Every project I've worked on so far has only required animation at 1/4 of 50hz at most.  Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.

I will give that a go and let you know!

Grim

Quote from: redbox on 13:57, 05 October 10will this affect the DMA if you are using it on the Plus...?
It should not interfere in any way with the DMA (except if you're using DMA-interrupts where strange things could happen :).

QuoteOne other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame!  :o
Indeed, even using raw sprite data and an uber-fast-copy routine, it takes a little bit more than one frame to update all the hardware sprites.

TFM

You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction  ;)
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Grim

#20
Quote from: TFM/FS on 21:50, 05 October 10
You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction  ;)
Errr... What?

One frame is 312*64 = 19968us with regular CRTC settings.


  • 16 sprites
  • 256 bytes each
  • LDI = 5us / byte

16*256*5 = 20480us = 20.5ms

Am I missing something? (or... is that LDI runs in 4us in your parallel universe? :)

TFM

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

arnoldemu

Quote from: TFM/FS on 01:53, 06 October 10
It does  ;) ;D
I remember you've got a 6Mhz CPC?
Has anyone tried making a 6Mhz Plus yet?
My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

Sykobee (Briggsy)

Quote from: arnoldemu on 09:34, 07 October 10
Has anyone tried making a 6Mhz Plus yet?


Why not skip straight to a 16MHz Z80 CPU? I think the CPC used a 16MHz base clock, divided by four for the Z80 so you'd want to bypass that bit of logic.


Wonder if anyone has a Minimig compatible CPC FPGA implementation yet.

steve

#24
Not (yet) cpc compatible, is the v6z8op+ available in very small quantity at retroleum, it cost £85 + p+p, it uses a 16mhz z80 and a FPGA to generate a VGA display with blitter and sound, it has 1152KB ram.

I have no idea how to reprogram the FPGA to make it cpc compatible but it is theoretically possible.

The designer of this board is now working on an eZ80 system which I will also get when it becomes available.

The eZ80 runs at up to 50 mhz which would be equivalent to a z80 running at 200mhz, who needs core i7 :laugh:
There is a free tcp/ip stack written for this chip so net access should be possible, once a browser has been ported.

Powered by SMFPacks Menu Editor Mod