Fastest cycles/byte data copy rate ? Better than 3µs/byte (sprites), 6µs/byte ?

cpcitor · 14:26, 18 January 13

Hi,

This new topic builds on an approach started with the previous one [Closed] Fastest cycles/byte memory write rate : answer 2µs per byte with PUSH. . The approach is like "let's tackle programming situations one by one starting from the simplest". I believe this can collectively improve the performance of prods on CPC. Think of it like making a map of the abstract space of methods to program the Z80. Once we have a map everyone can more easily take the most efficient route.

The previous topic focuses on just filling a contiguous memory area with a constant or a periodical pattern. In that scenario we don't need to fetch new data along the way. I believe nothing can beat PUSH (for long enough runs) with 2µs/byte bursts.

This topic focuses on a different question : say you need to write a long non-repeating pattern of bytes on a particular area.

I would start from the cheat sheet at Craving for speed ? A visual cheat sheet to help optimizing your code to death. and TFM/FS's remark :

Quote from: TFM/FS on 21:53, 16 January 13
- right, inc l only (fine in 256*256 screens without leaving borders)
- preload registers b, c, d, e before plotting sprites
- use ld (hl),&nn when other bytes are needed.

This remark is optimized for sprites (small runs of writes, perhaps repeating data).

Best solution for fixed data (sprites).

Code Select


LD (HL),r ; 2 µs, writes 1byte
LD (HL),nn ; 3 µs, writes 1 byte
INC L ; 1µs
; total 3 or 4µs/byte

Quote from: TFM/FS on 21:53, 16 January 13
Oh... and you don't load from stack or memory. YOU create a single routine for every sprite itslef! That makes it quick (else I would POP data from stack).

Yes, a dedicated routine for every graphics allows to unroll all loops and take all shortcuts.
More, since he dedicated routine always works with same data, we can use immediate values. Why we gain ? (1) we don't have to maintain a source pointer (it's PC), and (2) in particular no need to inc/decrement it, our source pointer is the self-incrementing PC !

But for situations where data is variable ?

Code Select


LD A, (HL) ; 2µs, reads 1 byte
INC L ; 1µs, assumes no carry, be careful about source memory alignment !
LD (DE), A ; 2µs, writes 1 byte
INC E ; 1µs, assumes no carry, ok if screen width aligned.
; total 6µs/byte

Ratio thanks to TFM/FS's hint is between 1/2 and 2/3 time consumption.
That is between +50% and +100% sprite area covered in a given time constraint.

Do you think we can do better ?

Sykobee (Briggsy) · 14:57, 18 January 13

The software sprite method is also good in that you can directly skip transparent bytes.
However it doesn't handle transparent pixels, as would be required for many sprites (although you can get away with it in MODE 0 often).
Maybe you could also illustrate the memory use for a 4x8 (MODE 0) or 8x8 (MODE 1) sprite / block for each method.

db6128 · 14:58, 18 January 13

Yes, we can get faster, because that's exactly what LDI is for. It takes 5 NOPs and does LD A,(HL):INC HL:LD (DE),A:INC DE:DEC BC.

So, you save one NOP, and you don't have to worry about keeping the pointers within a single 8-bit page each. The downside is that BC is altered, but for the sake of 1 NOP per byte, this may well be an acceptable trade-off.

fano · 15:16, 18 January 13

Nothing more about this "micro" approach , just a thought about sprite compiling.
I think you could improve fixed data approach with a good sprite data analysis.That should be possible to reload a register that cost 3nops when the potential gain is more important than the cost of operation.

cpcitor · 16:24, 18 January 13

Quote from: db6128 on 14:58, 18 January 13
Yes, we can get faster, because that's exactly what LDI is for. It takes 5 NOPs and does LD A,(HL):INC HL:LD (DE),A:INC DE:DEC BC.

So, you save one NOP, and you don't have to worry about keeping the pointers within a single 8-bit page each. The downside is that BC is altered, but for the sake of 1 NOP per byte, this may well be an acceptable trade-off.

Oh, you're right ! So LDI/LDIR are not so bad after all. So the power unrolled PUSH compared with unrolled LDI comes from the "no data to load" hypothesis (and what a gain ! 2 instead of 5).
And unrolled LDI is 5µs/byte instead of 6 for LDIR, still interesting.

TFM · 18:42, 18 January 13

Well, what I did was to create a kind of 'compiler' which analyses the data of a sprite and outputs the source-code of a routine to plot that sprite on screen.

However, if you use sprites running over a background then a more generic approach is beneficial. (But time consuming: Read screen RAM byte, (store if not buffering the whole screen), mask it, add sprite data, write back to screen RAM).

News:

Fastest cycles/byte data copy rate ? Better than 3µs/byte (sprites), 6µs/byte ?

cpcitor

Sykobee (Briggsy)

db6128

fano

cpcitor

TFM