Author Topic: Fastest cycles/byte data copy rate ? Better than 3µs/byte (sprites), 6µs/byte ?  (Read 1608 times)

0 Members and 1 Guest are viewing this topic.

Offline cpcitor

  • The user previously known as FindYWay
  • 464 Plus
  • *****
  • Posts: 351
  • Country: fr
  • My heart still runs on traditional CPC.
    • My code for the CPC.
  • Liked: 192
  • Likes Given: 449
Hi,

This new topic builds on an approach started with the previous one [Closed] Fastest cycles/byte memory write rate : answer 2µs per byte with PUSH.  . The approach is like "let's tackle programming situations one by one starting from the simplest". I believe this can collectively improve the performance of prods on CPC. Think of it like making a map of the abstract space of methods to program the Z80. Once we have a map everyone can more easily take the most efficient route.

The previous topic focuses on just filling a contiguous memory area with a constant or a periodical pattern. In that scenario we don't need to fetch new data along the way. I believe nothing can beat PUSH (for long enough runs) with 2µs/byte bursts.


This topic focuses on a different question : say you need to write a long non-repeating pattern of bytes on a particular area.


I would start from the cheat sheet at Craving for speed ? A visual cheat sheet to help optimizing your code to death. and TFM/FS's remark :

- right, inc l only (fine in 256*256 screens without leaving borders)
- preload registers b, c, d, e before plotting sprites
- use ld (hl),&nn when other bytes are needed.

This remark is optimized for sprites (small runs of writes, perhaps repeating data).

Best solution for fixed data (sprites).

Code: [Select]
LD (HL),r ; 2 µs, writes 1byte
LD (HL),nn ; 3 µs, writes 1 byte
INC L ; 1µs
; total 3 or 4µs/byte

Oh... and you don't load from stack or memory. YOU create a single routine for every sprite itslef! That makes it quick (else I would POP data from stack).

Yes, a dedicated routine for every graphics allows to unroll all loops and take all shortcuts.
More, since he dedicated routine always works with same data, we can use immediate values. Why we gain ? (1) we don't have to maintain a source pointer (it's PC), and (2) in particular no need to inc/decrement it, our source pointer is the self-incrementing PC !

But for situations where data is variable ?

Code: [Select]
LD A, (HL) ; 2µs, reads 1 byte
INC L ; 1µs, assumes no carry, be careful about source memory alignment !
LD (DE), A ; 2µs, writes 1 byte
INC E ; 1µs, assumes no carry, ok if screen width aligned.
; total 6µs/byte

Ratio thanks to TFM/FS's hint is between 1/2 and  2/3 time consumption.
That is between +50% and +100% sprite area covered in a given time constraint.

Do you think we can do better ?
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

Offline Sykobee (Briggsy)

  • 6128 Plus
  • ******
  • Posts: 828
  • Country: gb
  • Liked: 313
  • Likes Given: 490
The software sprite method is also good in that you can directly skip transparent bytes.
However it doesn't handle transparent pixels, as would be required for many sprites (although you can get away with it in MODE 0 often).
Maybe you could also illustrate the memory use for a 4x8 (MODE 0) or 8x8 (MODE 1) sprite / block for each method.

Offline db6128

  • 464 Plus
  • *****
  • Posts: 316
  • Country: gb
  • We don’t speak 8080 in this house.
  • Liked: 72
  • Likes Given: 44
Yes, we can get faster, because that’s exactly what LDI is for. It takes 5 NOPs and does LD A,(HL):INC HL:LD (DE),A:INC DE:DEC BC.
 
So, you save one NOP, and you don’t have to worry about keeping the pointers within a single 8-bit page each. The downside is that BC is altered, but for the sake of 1 NOP per byte, this may well be an acceptable trade-off.
[The owner of one of the few existing cartridges of Chase HQ 2] mentioned to me that unless someone could find a way to guarantee the code wouldn't be duplicated to anyone else, he wouldn't be interested.
Did he also say things like "My treasureeeeee" and is he a little grey guy?

Offline fano

  • Supporter
  • 6128 Plus
  • *
  • Posts: 835
  • Country: fr
  • Easter Egg Programmer
    • Easter Egg
  • Liked: 278
  • Likes Given: 614
Nothing more about this "micro" approach , just a thought about sprite compiling.
I think you could improve fixed data approach with a good sprite data analysis.That should be possible to reload a register that cost 3nops when the potential gain is more important than the cost of operation.
"NOP" is the perfect program : short , fast and (known) bug free

Follow Easter Egg products on Facebook !

Offline cpcitor

  • The user previously known as FindYWay
  • 464 Plus
  • *****
  • Posts: 351
  • Country: fr
  • My heart still runs on traditional CPC.
    • My code for the CPC.
  • Liked: 192
  • Likes Given: 449
Yes, we can get faster, because that’s exactly what LDI is for. It takes 5 NOPs and does LD A,(HL):INC HL:LD (DE),A:INC DE:DEC BC.
 
So, you save one NOP, and you don’t have to worry about keeping the pointers within a single 8-bit page each. The downside is that BC is altered, but for the sake of 1 NOP per byte, this may well be an acceptable trade-off.

Oh, you're right ! So LDI/LDIR are not so bad after all. So the power unrolled PUSH compared with unrolled LDI comes from the "no data to load" hypothesis (and what a gain ! 2 instead of 5).
And unrolled LDI is 5µs/byte instead of 6 for LDIR, still interesting.
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

Offline TFM

  • Visit the mysteries of the CPC at www.futureos.de
  • Supporter
  • 6128 Plus
  • *
  • Posts: 9.899
  • Country: aq
  • Space Chicken for FutureOS is free!
    • index.php?action=treasury
    • FutureOS - The revolution on CPC!
  • Liked: 1985
  • Likes Given: 4650
Well, what I did was to create a kind of 'compiler' which analyses the data of a sprite and outputs the source-code of a routine to plot that sprite on screen.
 
However, if you use sprites running over a background then a more generic approach is beneficial. (But time consuming: Read screen RAM byte, (store if not buffering the whole screen), mask it, add sprite data, write back to screen RAM).
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus