Hello,
Programming:Filling memory with a byte - CPCWiki documents the long-known fact that PUSH instruction writes more byte-per-cycle than e.g. LDIR.
I remember (from 20 years ago) that plain Z80 doc read "PUSH takes 10 cycles, POP takes 11" (because I found strange that they took different time, and why POP slower that PUSH). At the same era, I read that LDIR took 21 cycles per byte.
Actual CPC timings seem indeed rounded at next multiple of 4 cycles if we believe in
documentations:devices:z80 [Grimware].
There I read that LDIR speed is 6 nops per byte.
Question 1 : What is the actual speed PUSH-based memory fill ? (partial answer given)
Example code is given in section
Using the stack .
push hl -> 3 nops
dnjz PUSHLOOP -> 4 nops while in loop (3 at end)
That makes 7 nops per 2-bytes.
It is already twice as fast as LDIR !This estimate is good for big areas to fill. For small ones, setup cost is not negligible
Has anyone made a complete analysis ?
Question 2 : Can we do better ?Yes. We can partially unroll the loop.
There are variants. The actual speed will be closer to 3 nops per 2 bytes, again twice as fast as before.
This is about 4 times faster then LDIR !* If always the same length to write, just put many PUSH in a row.
* If that takes too much memory, just unroll partially, like 256 PUSH in a row. The loop will only be called 1 time out of 256 PUSH, that is not often.
* If length is to be variable and not too long, you can jump in the middle of the list of PUSH. For example, in pseudo-code :
If nbytes_to_write is odd, write extra byte, DEC nbytes_to_write.
Compute adress to jump to (it is PC + nbytes_to_write)
Jump
PUSH HL ;area with many PUSH HL
Figure out if we should continue
* If length is to be variable
and long, things can be combined.
Wiping the CPC screen requires 16000 bytes (if optimizing to not write invisible scroling area). That would take 8000 PUSH or 24000 NOPS. A full screen retrace is about 20000 NOPS, see
Frame flyback and interrupts .
That means, if we reduce a little the screen area, a CPC *can* write full screen memory in a single frame.
Question 3 : does an instruction exist on CPC that can write with an average speed of better than 3 nops per 2 bytes ?EDIT : Changed title to close the topic.