I have written a routine that copies a Plus hardware sprite stored in a compressed format (i.e. instead of &01,&02,&03,&04 etc it's stored as &12, &34) to the ASIC. Can anyone make it any faster...?
ld hl,&2000 ;stored sprite data location
ld de,&4000 ;address of ASIC sprite location
copy_spr_asic:
ld b,128 ;number of compressed pieces of data for sprite
spr_asic_loop:
ld a,(hl) ;get byte of compressed data
sra a ;shift the bits right
sra a ;4 times to, for example this
sra a ;turns &17 into &01
sra a
ld (de),a ;copy it to the ASIC in uncompressed form
inc de ;increase ASIC location to next byte
ld a,(hl) ;get same byte of compressed data
and %00001111 ;and delete bits 7 to 4, e.g. turns &17 into &07
ld (de),a ;copy it to the ASIC in uncompressed form
inc de ;increase ASIC location to next byte
inc hl ;move onto next byte of compressed data
djnz spr_asic_loop
ret
And why we're on the subject of optimisation, does anyone know how to convert T-states into microseconds for the CPC?
Due to architecture constraints , all instructions timings are multiples of 4 tstates on CPC.4 tstates = 1 µs = 1 NOP = 1 CRTC char width = 1 Mode 1 Char width
This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)
http://www.grimware.org/doku.php/documentations/devices/z80 (http://www.grimware.org/doku.php/documentations/devices/z80)
I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222 and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.
Quote from: fano on 16:34, 02 October 10
This is a chart with most used instructions , Winape owns a NOP counter to see instructions timing too (there is one at Quasar too but it is in French language)
http://www.grimware.org/doku.php/documentations/devices/z80 (http://www.grimware.org/doku.php/documentations/devices/z80)
Thanks for the link, I should have thought to look in Grim's website ;) Where is the NOP counter in WinAPE? The help file doesn't load on my version (I'm annoyingly using Windows Vista on my home laptop and the help file doesn't load on it).
Quote from: fano on 16:34, 02 October 10
I wrote too this type of packed sprites code , i'd suggest you to store right pixel on the left and left pixel on the right of the byte like this 0b11110000 , 0b33332222 and so on (unlike like me the first time for RD128+).This way , for the first one you can write it directly without shifting .More , you will not have to reload the byte to get the second pixel.Another thing is ASIC will do a 'AND 15' when getting pixel so you don't have to take about the 4 upper bits.
That's really great Fano, thanks. I did wonder about whether the ASIC bothered with the upper 4 bits! Am glad I wrote my own capture routine now because I can easily adjust it to write the data in the reversed format.
Look at the picture , it is under registers where is "T 0" , the red cross is to reset the counter.
Ok, guys, nice so far :) Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS ;) 8) :laugh:
Quote from: TFM/FS on 22:30, 02 October 10
Ok, guys, nice so far :) Now replace these slow "SRA A" commands by the most quick "RRCA" commands, that saves a lot of time. That's the way I do it in FutureOS ;) 8) :laugh:
Yes, that's a good point!
I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.
So now we have:
;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc
ld hl,&2000 ;stored sprite data location
ld de,&4000 ;address of ASIC sprite location
copy_spr_asic:
ld b,128 ;number of compressed pieces of data for sprite
spr_asic_loop:
ld a,(hl) ;get byte of reverse compressed data
ld (de),a ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
inc de ;increase ASIC location to next byte
rrca ;shift the bits right
rrca ;4 times, for example this
rrca ;turns &37 into &03
rrca
ld (de),a ;copy it to the ASIC
inc de
inc hl ;move onto next byte of compressed data
djnz spr_asic_loop
You should be able to replace those 3 16bit incs with 8 bit incs as well, or at least 2 of them if you don't want to keep the source data page aligned.
Quote from: redbox on 23:01, 02 October 10
Yes, that's a good point!
I was using the SRA A because I didn't realise the ASIC performed a AND %00001111 on the number (as pointed out by Fano earlier), but now we know it does RRCA can be used instead.
It's a feature, not a bug, but I'm sure at the beginning they were planning 256 colors, then the "management" told the developpers "Use only half of the ASIC RAM, we save the other part". But what managers don't know is if you take only half of the RAM (4 of 8 bits) then you have only 1/16 of the colors (16 instead of 256) - what a pity.
Unroll the code.
Use a lookup table to get shifted pixel data (256 bytes for lookup table ;) )
inc de -> inc e
use asics "auto and &f"
if hl is aligned you can also use inc l
ld hl,&2000 ;stored sprite data location
ld de,&4000 ;address of ASIC sprite location
rept 128
ld a,(hl) ;get byte of compressed data
ld c,a
ld a,(bc) ; use lookup table to get shifted version
ld (de),a
inc e ;increase ASIC location to next byte
ld (de),a ;copy it to the ASIC in uncompressed form
inc e ;increase ASIC location to next byte
inc hl ;move onto next byte of compressed data
endm
Quote from: arnoldemu on 09:33, 05 October 10
Use a lookup table to get shifted pixel data (256 bytes for lookup table ;) )
if hl is aligned you can also use inc l
This is fast, but obviously at the expense of size of code.
What would the lookup table be like? I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?
Also, what does 'page-aligned' mean? Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...? :(
Quote from: redbox on 10:09, 05 October 10
This is fast, but obviously at the expense of size of code.
What would the lookup table be like? I assume at initialization you LD BC,table but then can't work out what format the table would be or why then loading the table into A doesn't wipe out the byte of compressed data we've just loaded into it previously...?
Also, what does 'page-aligned' mean? Axelay mentioned it earlier and I did change the INCs in the old routine to 8-bit but left the last one alone as I didn't understand the page-aligned bit...? :(
The table would be 256 bytes long (one value for each of the possible values in the compressed data). It would effectively store the value of the compressed data shifted to the right 4 times. so &a3 would result in &0a, and &a0 would also result in &0a.
This table would be initialised at the beginning of your program and never modified after. It could go into ROM if the game was cartridge based.
The table should be positioned in ram so that it's lowest byte is 0. It is then aligned to a 256-byte boundary, e.g. it's start is a multiple of 256.
Then, we only need to do LD B,table/256 to set it's location for the code.
When we load the compressed data, this forms the lower 8-bits of the address. We load the data into the C register. Now we have formed the address in the table. We then read from this address (into A register) to get the shifted pixel. C remains unchanged.
We can write the shifted pixel. Then we can use C itself and write that to ram, knowing that the asic will AND the data.
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.
Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.
So, a compressed sprite is 128 bytes. If you located it so it's lowest 8 bits were 0 or &80, you could then use LD HL, to set the initial address and then INC L to move through the data, knowing that when you increment L it will never go past 256 and will then never cause H to be modified.
So by doing both of this, now you can use an instruction that is 2 times faster :)
Quote from: arnoldemu on 10:30, 05 October 10
The key here is that C is 8-bit value, it can be used to form address in table to lookup, and itself is part of the data written to asic ram.
Effectively aligning something means you position it in ram so that it's start address is a multiple of some value, and this also then means that you can make assumptions about how to access and move through the data.
I understand how the routine and page-aligning works now, many thanks for the explanations :)
I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?
So here are the two routines fully optimised:
;Display Plus hardware sprite stored in reverse compressed format
;e.g. &01,&02,&03,&04 etc is stored as &21,&43 etc
;49 x 128 = 6272 T-states = 1568 microseconds
org &8000
ld hl,&2000 ;stored sprite data location (page-aligned)
ld de,&4000 ;address of ASIC sprite location
repeat 128
ld a,(hl) ;get byte of reverse compressed data
ld (de),a ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
inc e ;increase ASIC location to next byte
rrca ;shift the bits right
rrca ;4 times, for example this
rrca ;turns &37 into &03
rrca
ld (de),a ;copy it to the ASIC
inc e
inc l ;move onto next byte of compressed data
endm
ret
;Display Plus hardware sprite stored in compressed format
;e.g. &01,&02,&03,&04 etc is stored as &12,&34 etc
;48 x 128 = 6144 T-states = 1536 microseconds
org &8000
ld hl,&2000 ;stored sprite data location (page-aligned)
ld de,&4000 ;address of ASIC sprite location
ld b,table/256 ;table (page-aligned)
repeat 128
ld a,(hl) ;get byte of compressed data
ld c,a ;copy it to the ASIC (ASIC does a AND %00001111 and ignores upper 4 bits)
ld a,(bc) ;use lookup table to get shifted version
ld (de),a
inc e ;increase ASIC location to next byte
ld a,c
ld (de),a ;copy it to the ASIC in uncompressed form
inc e ;increase ASIC location to next byte
inc l ;move onto next byte of compressed data
endm
ret
org &9000
table: defb &00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00,&00
defb &01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01,&01
defb &02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02,&02
defb &03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03,&03
defb &04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04,&04
defb &05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05,&05
defb &06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06,&06
defb &07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07,&07
defb &08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08,&08
defb &09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09,&09
defb &0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A,&0A
defb &0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B,&0B
defb &0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C,&0C
defb &0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D,&0D
defb &0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E,&0E
defb &0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F,&0F
So the second routine
is slightly faster but has the overhead of the table...!
Quote from: redbox on 12:17, 05 October 10
I understand how the routine and page-aligning works now, many thanks for the explanations :)
I see in your routine that you have forgotten to LD A,C after the first LD (DE),A : INC E, as otherwise the same value loaded from the look-up table would be copied to the ASIC twice...?
I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.
Quote from: arnoldemu on 12:20, 05 October 10
I was planning to swap the roles of HL and DE then I could use LD (DE),C for the final write.
This would make it slightly faster again because you don't need the ld a,c then.
You mean you could use LD (HL),C ;)
But yes, this would revise it to 44 x 128 = 5632 T-states = 1408 microseconds, which would make it significantly faster than the other routine (10.21% faster) :)
Below is an alternative version using the stack to fetch and process 2 bytes of packed data in one go, which is a little bit faster but comes with constraints on the interrupts (which might be a no-go in some cases).
org &1000
run $,spriteDepack
asic_sprite0 equ &4000
; Fill the lookup table
call spriteDepack_init
; Depack sprite
ld hl,data_sprite
ld de,asic_sprite0
call spriteDepack
ret
; Copy packed sprite into the asic RAM
;
; Input
; HL = address of packed sprite data
; DE = address of ASIC sprite
spriteDepack:
; disable interrupts and init the stack
di
ld (spriteBlit_var_sp),sp
ld sp,hl
ex de,hl
; Init the lookup table pointer
ld d,lut_depackSprite / 256
repeat 64
; fetch 4 packed pixels
pop bc ;3
; write pixel 1
ld (hl),c ;2
inc l ;1
; lookup pixel 2
ld e,c ;1
ld a,(de) ;2
; write pixel 2
ld (hl),a ;2
inc l ;1
; write pixel 3
ld (hl),b ;2
inc l ;1
; lookup pixel 4
ld e,b ;1
ld a,(de) ;2
; write pixel 4
ld (hl),a ;2
inc l ;1 (<- will be overwritten with the ORG adjustment below)
;= 21us @ 4 pixels
rend ;= 21*64 = 1344us
org $-1 ; move one byte back to overwrite an useless inc l
spriteBlit_var_sp equ $+1
ld sp,0
ei
ret ; 845 bytes... uuwwh!
; Fill the 256 bytes lookup table
spriteDepack_init:
ld hl,lut_depackSprite
spriteDepack_init_loop
ld a,l
rrca
rrca
rrca
rrca
;and %1111 ; would be more meaningful than useful =)
ld (hl),a
inc l
jr nz,spriteDepack_init_loop
ret
data_sprite ; some packed sprite data here
align 256
lut_depackSprite ; lookup table here
That's an interesting way to take it further Grim, but will this affect the DMA if you are using it on the Plus...?
This process has been incredibly interesting for me as I've learnt new techniques and also generally how to optimize code and I'm currently going over lots of routines and giving them the treatment :)
One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame! :o
Quote from: redbox on 13:57, 05 October 10
One other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame! :o
Every project I've worked on so far has only required animation at 1/4 of 50hz at most. Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.
Quote from: Axelay on 14:15, 05 October 10
Every project I've worked on so far has only required animation at 1/4 of 50hz at most. Unless you have a lot of small incremental frames you should be able to get away with updating, say 4 sprites every frame, moving through them in blocks of 4, and it would look perfectly fine.
I will give that a go and let you know!
Quote from: redbox on 13:57, 05 October 10will this affect the DMA if you are using it on the Plus...?
It should not interfere in any way with the DMA (except if you're using DMA-interrupts where strange things could happen :).
QuoteOne other thing it's taught me is it's not possible to update all 16 hardware sprites in one 50hz frame! :o
Indeed, even using raw sprite data and an uber-fast-copy routine, it takes a little bit more than one frame to update all the hardware sprites.
You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction ;)
Quote from: TFM/FS on 21:50, 05 October 10
You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction ;)
Errr... What?
One frame is 312*64 = 19968us with regular CRTC settings.
- 16 sprites
- 256 bytes each
- LDI = 5us / byte
16*256*5 = 20480us = 20.5ms
Am I missing something? (or... is that LDI runs in 4us in your parallel universe? :)
It does ;) ;D
Quote from: TFM/FS on 01:53, 06 October 10
It does ;) ;D
I remember you've got a 6Mhz CPC?
Has anyone tried making a 6Mhz Plus yet?
Quote from: arnoldemu on 09:34, 07 October 10
Has anyone tried making a 6Mhz Plus yet?
Why not skip straight to a 16MHz Z80 CPU? I think the CPC used a 16MHz base clock, divided by four for the Z80 so you'd want to bypass that bit of logic.
Wonder if anyone has a Minimig compatible CPC FPGA implementation yet.
Not (yet) cpc compatible, is the v6z8op+ available in very small quantity at retroleum, it cost £85 + p+p, it uses a 16mhz z80 and a FPGA to generate a VGA display with blitter and sound, it has 1152KB ram.
I have no idea how to reprogram the FPGA to make it cpc compatible but it is theoretically possible.
The designer of this board is now working on an eZ80 system which I will also get when it becomes available.
The eZ80 runs at up to 50 mhz which would be equivalent to a z80 running at 200mhz, who needs core i7 :laugh:
There is a free tcp/ip stack written for this chip so net access should be possible, once a browser has been ported.
Quote from: arnoldemu on 09:34, 07 October 10
I remember you've got a 6Mhz CPC?
Has anyone tried making a 6Mhz Plus yet?
Not me, it's more complicated in the Plus. This crystal oszillator is different.
In the good old 6128 you basically have only to switch the Crystal (24 MHz instead of 16 MHz) and the Z80 (B or H instead of A). BTW: The crystals should be switchable (switches for all three connections (but one wire can also be ok, even not in every CPC though)).
Edit: Here (just from my memories) some try of a page:
http://www.cpcwiki.eu/index.php/6_MHz_CPC (http://www.cpcwiki.eu/index.php/6_MHz_CPC)
Quote from: Briggsy on 14:47, 07 October 10
Why not skip straight to a 16MHz Z80 CPU? I think the CPC used a 16MHz base clock, divided by four for the Z80 so you'd want to bypass that bit of logic.
Wonder if anyone has a Minimig compatible CPC FPGA implementation yet.
Right. It's a 16 MHz crystal. You can use 6 MHz for the whole system (I did only test few boards!) but not 8 MHz. 8 definitely doesn't work. If the Z80 shall run more quick than 6 MHz then you have to use an own crystal for the Z80 and one for the rest of the system.
The overclocking of the _WHOLE_ system has the following features:
- Faster CPU
- Better Graphics resolution
- Faster Floppy with formats holding 50% more data
- Sound is also better.
Disadvantages:
- Not all expansions can work with the increased bus speed
(Eprom Boards do, ROM-RAM-Boxes don't do).
Another possible routine:
ld b,xlatcs / 256
repeat 128
ld c,(hl) ;2
ldi ;7
ld a,(bc) ;9
ld (de),a ;11
inc e ;12
endr
align 256
.xlatcs
repeat 256
db $ / 16 and 15
endr
Shame the CPC 6128 didn't come at 6MHz by default, I think the Z80B was out by then. Extra costs were a no-no though for Amstrad.
Quote from: Briggsy on 14:46, 13 October 10
Shame the CPC 6128 didn't come at 6MHz by default, I think the Z80B was out by then. Extra costs were a no-no though for Amstrad.
Why not use 16 MHz? Look at Anne (the PCW Plus ;) , it had a 16 MHz Z80. But even using a good old Z80H with 8 MHz would be nice.
However, if the Plus would have a multiple of 4 MHz if wouldn't be a problem to switch it back to 4 MHz for games (for example, or whenever needed). Amstrad just saved money. To make the ASIC faster could be one problem... ???
Did ever somebody try to replace the crystal of the CPC Plus?
Quote from: TFM/FS on 21:50, 05 October 10
You can update all hardware sprites (16) in 16 ms, a frame has 20 ms, just use a LDI:LDI:LDi... construction ;)
As Grim said, you can't do that, but you can do it with compiled sprites:
ld hl,#0304 ;3
push hl ;8
And sometimes you can get away with 8 bit loads or pushing the same value. Even if you set every pixel individually this way, it's 8 * 128 * 16 = 16384 us plus a bit of overhead for storing the stack pointer.
Well.... and whats about ....
LD HL,&XXXX:PUSH HL
LD HL,&XXXX:PUSH HL
LD HL,&XXXX:PUSH HL
LD HL,&XXXX:PUSH HL ;already 8 bytes transferred.... and counting ;D
...
..
.
... and so on an on and on...... ok, it uses a lot of memory but, it's damn quick... Now recalculate, but keep in mind folks that the PUSH writes two bytes ;D :laugh: ;)
EDIT: To explain this more in detail, the interrupts are off, the SP is saved and at the start of that "load all 16 sprites"-routine the SP points to the upper end of the sprite area (memory mapped I/O activated!) ... And you see you can do it all in one FRAME. TFM frames again... ;)
BTW: I use similar techniques, but way more advanced, for FilmeMacher / MovieMaker
Quote from: Executioner on 01:57, 13 October 10
ld b,xlatcs / 256
repeat 128
ld c,(hl) ;2
ldi ;7
ld a,(bc) ;9
ld (de),a ;11
inc e ;12
endr
If the byte fetched by the
ld c,(hl) is
&00, the following
LDI will decrement
B. This will screw up the LUT pointer and half of the remaining pixels in the sprite (those fetched from the lookup table pointed by
BC). Also, when
C=&x0, the
LDI will decrement
x (which is used for the lookup) and corrupt another pixel. Unless the sprite graphics are not using transparency at all, it seems you're running into some problems with that.
Compiled hardware sprites are fast, but that's quite a big, heavy and sad machinery for a handful of pixels imo. Thank you Amstrad! (should have bought an Amiga sooner... :)
Quote from: TFM/FS on 06:06, 14 October 10
Well.... and whats about ....
LD HL,&XXXX:PUSH HL
LD HL,&XXXX:PUSH HL
Isn't that exactly what I had in my post?
Quote from: Grim on 07:28, 14 October 10
If the byte fetched by the ld c,(hl) is &00, the following LDI will decrement B. This will screw up the LUT pointer ...
Yes, that's a slight problem, but you can overcome it by restricting it to 15 colours in every second pixel which should be enough in 99.99% of cases. (ie. for pixels A (0..14) and B(0..14), the value stored is (A + 1) * 16 + B and the lookup table translates it back but storing (C / 16) - 1.
The compressed version of the PUSH mechanism can actually be nearly as memory efficient as compressed sprites for less complex sprites. A lot of the time you can simply do something like:
LD H,L
PUSH HL
PUSH HL
PUSH HL
PUSH HL
PUSH HL
PUSH HL
PUSH HL
PUSH HL
for a whole row of one solid colour.
I'm writing something with 3 colour (plus transparent) sprites (like Frogger has) and it just moves values around between registers HL, DE, BC and A and pushes whichever register it needs.
Quote from: Executioner on 12:57, 14 October 10
Isn't that exactly what I had in my post?
I don't know... which one was it? Ok... I start reading this looooong thread from the beginning.
But I'm glad that I'm not the only one with such ideas ;D
QuoteAnd sometimes you can get away with 8 bit loads or pushing the same value. Even if you set every pixel individually this way, it's 8 * 128 * 16 = 16384 us plus a bit of overhead for storing the stack pointer.
Little error :
LD HL,xxxx = 3 us
PUSH HL = 4 us
7 x 128 x 16 = 14.336 ms
Hey Longshot, good to see you here :) Right, push takes longer than pop if I remember right ;)
Quote from: Longshot on 15:05, 29 October 10
PUSH HL = 4 us
Yeah, for some reason I had it in mind it took 5us. So it's even faster than I thought :)