News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_cpcitor

[Closed] Fastest cycles/byte memory write rate : answer 2µs per byte with PUSH.

Started by cpcitor, 12:41, 14 January 13

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

cpcitor

Hello,

Programming:Filling memory with a byte - CPCWiki documents the long-known fact that PUSH instruction writes more byte-per-cycle than e.g. LDIR.

I remember (from 20 years ago) that plain Z80 doc read "PUSH takes 10 cycles, POP takes 11" (because I found strange that they took different time, and why POP slower that PUSH). At the same era, I read that LDIR took 21 cycles per byte.

Actual CPC timings seem indeed rounded at next multiple of 4 cycles if we believe in documentations:devices:z80 [Grimware].

There I read that LDIR speed is 6 nops per byte.

Question 1 : What is the actual speed PUSH-based memory fill ? (partial answer given)

Example code is given in section Using the stack .


push hl -> 3 nops
dnjz PUSHLOOP -> 4 nops while in loop (3 at end)


That makes 7 nops per 2-bytes.

It is already twice as fast as LDIR !

This estimate is good for big areas to fill. For small ones, setup cost is not negligible

Has anyone made a complete analysis ?

Question 2 : Can we do better ?

Yes. We can partially unroll the loop.

There are variants. The actual speed will be closer to 3 nops per 2 bytes, again twice as fast as before.

This is about 4 times faster then LDIR !

* If always the same length to write, just put many PUSH in a row.
* If that takes too much memory, just unroll partially, like 256 PUSH in a row. The loop will only be called 1 time out of 256 PUSH, that is not often.

* If length is to be variable and not too long, you can jump in the middle of the list of PUSH. For example, in pseudo-code :


If nbytes_to_write is odd, write extra byte, DEC nbytes_to_write.
Compute adress to jump to (it is PC + nbytes_to_write)
Jump
PUSH HL ;area with many PUSH HL
Figure out if we should continue


* If length is to be variable and long, things can be combined.

Wiping the CPC screen requires 16000 bytes (if optimizing to not write invisible scroling area). That would take 8000 PUSH or 24000 NOPS. A full screen retrace is about 20000 NOPS, see Frame flyback and interrupts .

That means, if we reduce a little the screen area, a CPC *can* write full screen memory in a single frame.

Question 3 : does an instruction exist on CPC that can write with an average speed of better than 3 nops per 2 bytes ?

EDIT : Changed title to close the topic.
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

Optimus

Iirc PUSH takes 4 NOPs. It's POP that takes 3 NOPs.


19968 NOPs per VBL. PUSH can write 2 bytes, so 2 NOPs per byte. That's 9984 bytes at best. 62.4% of CPC fullscreen.
Still, 25 fps are not bad for Sub Hunter parallax scrolling where this technique is used.

And problem with PUSH is that you can't just copy any arbitrary graphics just as in LDI. You have to preload as many regs as you can with your graphics before. Usually, it's just the same repeating pattern on the screen (like the repeating clouds on Sub Hunter) or just the same color, useful maybe for filling flat polygons (I tried it recently in my rasterizer, 4 mode 0 pixels in one go, but I had to also write edge pixels at left and right of scanline with some not very optimal code and lost the cycles I was hoping to gain except if polygon was very wide :P ).

cpcitor

Thank you for the information. How do you know they used that PUSH trick ? Which variant ? Unrolled loops ?

Writing even a constant value may be useful to clear area to prepare next frame, for example in space simulations (starion, starglider, elite, 3D starstrike), flight simulators, or similar (e.g. Stunt Car Racer that last one is better technically than pleasing to the eye).

CPC GAME REVIEWS - S
CPC GAME REVIEWS - S
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

TFM

Yes, you an be more quick that 1.5 ys. (BTW Optimus is surely right about the 2 ys/byte when using PUSH).

Now how to get more quick... Take a look at the CPCWiki, search for the 6 MHz CPC. I did create such a machine in the 90ies and it does run like hell :) That way you my achieve about 1.3 ys.
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Axelay

Quote from: FindYWay on 21:27, 14 January 13
Thank you for the information. How do you know they used that PUSH trick ? Which variant ? Unrolled loops ?



See for yourself.  ;)  Starts from the label ".PrintScroll", about halfway down the source file.




TFM

Well, that way works for smaller screens perfectly. But for bigger screens real scrolling using the CRTC saves a lot of CPU time.
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

cpcitor

Quote from: TFM/FS on 21:43, 14 January 13
Yes, you an be more quick that 1.5 ys.

Technically you're right. I only used µs because grimware does.
You know I meant NOPs.

Quote from: TFM/FS on 21:43, 14 January 13
(BTW Optimus is surely right about the 2 ys/byte when using PUSH).

So http://www.grimware.org/doku.php/documentations/devices/z80#stack.operations is wrong ?
There only IX IY related PUSH/POP are 4 NOPs, normal ones are 3 NOPS.

Quote
Now how to get more quick... Take a look at the CPCWiki, search for the 6 MHz CPC. I did create such a machine in the 90ies and it does run like hell :) That way you my achieve about 1.3 ys.

I've seen that already long before registering here. Is the MTBF impacted ?
Too bad it can't read normal disks... But it can read tapes, can't it ?

Quote from: Axelay on 08:44, 15 January 13

See for yourself.  ;)  Starts from the label ".PrintScroll", about halfway down the source file.

Smart code indeed.
Let's see what lines appear at least 10 times in that source code :

$ sort <SubHunter_Scroll.asm  | uniq -c | sort -rn | grep -v '^      '
    224     push de
    192     push bc
    104     inc l <-- cheaper than inc HL, yes
     88     exx  <-- wow, yes allows 4 different 16-bit values for texture. Smart !
     61
     31     dec a <-- primary counter
     18     ld h,a
     18     ld e,(hl)
     18     ld d,(hl)
     16     ld c,(hl)
     16     ld b,(hl)
     15 ;
     14     ld a,2
     12     ex af,af'

Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

TFM

Quote from: FindYWay on 19:03, 15 January 13
So http://www.grimware.org/doku.php/documentations/devices/z80#stack.operations is wrong ?

Yes, that's wrong. POP is 3 ys and PUSH is 4 ys (for AF, BC, DE, HL).

Just program an test program and let it run on a REAL CPC.
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

db6128

No need:
Unofficial Amstrad WWW Resource
Tested and verified on real hardware by Kevin (arnoldemu) and Richard (Executioner).
Quote from: Devilmarkus on 13:04, 27 February 12
Quote from: ukmarkh on 11:38, 27 February 12[The owner of one of the few existing cartridges of Chase HQ 2] mentioned to me that unless someone could find a way to guarantee the code wouldn't be duplicated to anyone else, he wouldn't be interested.
Did he also say things like "My treasureeeeee" and is he a little grey guy?

cpcitor

Ok, so it seems more or less proven that PUSH is the instruction with the highest throughput (lowest number of cycle per byte written) on the CPC.
Thanks for your hints.
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

TFM

Haha! In a theoretical way yes... but for drawing my sprites I use a way that needs only 2 ys per byte too and I don't have to deal with PUSH associated problems (DI, reset SP etc.). It's the good old LD (HL),R instruction where R is A, B, C, D, E, H or L.
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

db6128

Do you load those registers from memory or from the stack?

And I guess you make sure HL doesn't overflow onto another page, and use INC L instead of INC HL. :D
Quote from: Devilmarkus on 13:04, 27 February 12
Quote from: ukmarkh on 11:38, 27 February 12[The owner of one of the few existing cartridges of Chase HQ 2] mentioned to me that unless someone could find a way to guarantee the code wouldn't be duplicated to anyone else, he wouldn't be interested.
Did he also say things like "My treasureeeeee" and is he a little grey guy?

TFM

- right, inc l only (fine in 256*256 screens without leaving borders)
- preload registers b, c, d, e before plotting sprites
- use ld (hl),&nn when other bytes are needed.

But it's amazing how much you can print with only preloading b, c, d and e.


Oh... and you don't load from stack or memory. YOU create a single routine for every sprite itslef! That makes it quick (else I would POP data from stack).
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

cpcitor

Hi everyone!

Remember the chart of Z80 instructions visually aligned by timing (quicker on the left, slower on the right)?

Still alive after 10 years, I made a revised one.

* V1.1 2013-10-19
* V1.2 2022-03-06 clarified special cases of rp
* V1.3 2023-07-09 replace rp notation with explicit register list (only omitting IY). Benefit: use your viewer's search feature for e.g. SP and see all instructions that involve SP at a glance!

Now even the beginner scrutinizing the document can notice that there are instructions to load or save BC from a fixed address.

6 NOPs
LD (nnnn),BC
LD BC,(nnnn)

Sure, they're relatively slow, but even relatively slow operations sometimes have their use in a time-constrained situation. Loading/saving a 16-bit register at a known address (possibly computed elsewhere via self-modifying code) without impacting any other register may be useful sometimes (yes I know about EXX ;) ).

Anyway, here is the document attached!

(Should this create expectations of some new prod one day? Mmh, it might well be, but don't hold your breath. ::) )
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

norecess464

Quote from: cpcitor on 12:24, 09 July 23Remember the chart of Z80 instructions visually aligned by timing (quicker on the left, slower on the right)?
Ohhhhh yes of course I remember it, I relied so heavily on it while creating the phX demo, it was very useful to me, thanks !!!!

Now I know by heart the timings for most of the Z80 opcodes so I only open that doc on rare occasions  ;)

Thanks for the update!
My personal website: https://norecess.cpcscene.net
My current project is Sonic GX, a remake of Sonic the Hedgehog for the awesome Amstrad GX-4000 game console!

cpcitor

Quote from: norecess on 13:46, 09 July 23Ohhhhh yes of course I remember it, I relied so heavily on it while creating the phX demo, it was very useful to me, thanks !!!!

Great!

Just watched phX demo again. Impressive feat, varied parts in quick successions!

I'm happy my doc was useful in creating it.

Quote from: norecess on 13:46, 09 July 23Now I know by heart the timings for most of the Z80 opcodes so I only open that doc on rare occasions  ;)

Thanks! You pinpointed it exactly: gathering information and presenting it in a convenient manner helps see patterns and memorize them.

I experienced the exact same effect: making this document helped me learn the timings and I generally know them by heart. Which is handy when sorting out different ways to code something right in your head.

Quote from: norecess on 13:46, 09 July 23Thanks for the update!

Have a nice day!
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

GUNHED

Well reading is even more quick. Using POP BC/DE/HL (f.e.). Only 1,5 us per Byte.  :laugh:
http://futureos.de --> Get the revolutionary FutureOS (Update: 2023.11.30)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

zhulien

Can you make a cls rom which takes no ram but is as unrolled as possible to clear the screen as fast as possible?  Could a similar screen copy rom be made which is totally unrolled to copy a screen from one bit of ram to c000?

I know we can point the crtc to 4000, 8000 etc but sometimes our software isn't ideal there.

Powered by SMFPacks Menu Editor Mod