LDIR timing

litwr · 08:34, 21 May 23

I am curious how fast we can copy memory using LDIR. I supposed that LDIR needs 24 ticks on the Amstrad.
So I use formula 3.2*10^6/24 and get 133.33 KB/s. I have also done a practical test

Code Select

>5000 80
.  5001  21 00 50    LD    HL, 5000
.  5004  11 20 50    LD    DE, 5020
.  5007  01 00 40    LD    BC, 4000
.  500A  ED B0        LDIR
.  500C  3A 00 50    LD    A, (5000)
.  500F  3D          DEC  A
.  5010  32 00 50    LD    (5000), A
.  5013  C2 01 50    JP    NZ, 5001

This test gives an unexpected result about 14s that means 16*1024*128/14 about 150 KB/s. I use ep128emu.
Would anybody help me find out what is the speed of LDIR on the CPC? Many thanks in advance.

andycadley · 08:37, 21 May 23

Where did you get 3.2 from? The Z80 in the CPC is clocked at 4MHz (or close as)

litwr · 08:49, 21 May 23

Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?

roudoudou · 09:17, 21 May 23

LDIR on CPC is (BC-1)x6+5 NOPS (1 NOP = 4 ticks)
If executed with BC=0 then compute NOPS with BC=65536
At 4MHz on CPC the theorical transfer rate is 166.400 bytes per second

You can run faster by unrolling LDI like this https://map.grauw.nl/articles/fast_loops.php
Unrolled 16x like example, you can reach 192.462 bytes per second and even 196.000 bytes per second unrolling 32x

andycadley · 10:37, 21 May 23

Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?

The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.

But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.

litwr · 10:47, 21 May 23

Quote from: roudoudou on 09:17, 21 May 23LDIR on CPC is (BC-1)x6+5 NOPS (1 NOP = 4 ticks)
If executed with BC=0 then compute NOPS with BC=65536
At 4MHz on CPC the theorical transfer rate is 166.400 bytes per second

You can run faster by unrolling LDI like this https://map.grauw.nl/articles/fast_loops.php
Unrolled 16x like example, you can reach 192.462 bytes per second and even 196.000 bytes per second unrolling 32x

Why 166,400? IMHO it must be rather approximately 166,667. For 65536 bytes we get 4000000/(65535*6+5)*65536/4 ≈ 166667.1 bps.
BTW Some crazy people use code like the next to copy bytes

Code Select

    pop hl
    ld (addr),hl
    pop hl
    ld (addr+2),hl
    ...

This sequence is 15% faster than LDI/LDD.

EDIT.
Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.

litwr · 10:56, 21 May 23

Quote from: andycadley on 10:37, 21 May 23
Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?

The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.

But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.

So, LDI/LDD requires 5 NOPS (20 ticks), doesn't it? I thought it is only 4 NOPS. I am curios why LDI needs one more NOP while MOV r,r doesn't? It is also very interesting why is the 3.2 MHz approximation to the average frequency of the CPU very accurate for many cases?

andycadley · 11:05, 21 May 23

Quote from: litwr on 10:56, 21 May 23
Quote from: andycadley on 10:37, 21 May 23
Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?

The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.

But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.
So, LDI/LDD requires 5 NOPS (20 ticks), doesn't it? I thought it is only 4 NOPS. I am curios why LDI needs one more NOP while MOV r,r doesn't? It is also very interesting why is the 3.2 MHz approximation to the average frequency of the CPU very accurate for many cases?

It's because of how the gate array arbitrates bus cycles. Essentially it holds WAIT such that the Z80 can only access memory on specific cycles, this has the effect of stretching the internal M-cycles of the CPU to be a multiple of 4 T-states. So whether or not an instruction gets extended (and by how much) is all related to how many times it accesses memory (and precisely when).

3.2 is accurate for many cases, because it's not just a number plucked out of the air, but based on real world comparisons. People timed common routines on different machines, and then came up with rough estimations of how much slowdown each suffered based on the run time (using something like a routine running in uncontended memory on a spectrum as a baseline). So it matches up because that's how it was derived in the first place.

abalore · 13:11, 21 May 23

Quote from: litwr on 10:47, 21 May 23
Code Select Expand
pop hl ld (addr),hl pop hl ld (addr+2),hl ...This sequence is 15% faster than LDI/LDD.
EDIT.
Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.

That sequence is onlyl useful to write to fixed memory locations, while LDI/LDD allows to set the destination address. So they are not functionally equivalent.

GUNHED · 13:49, 21 May 23

Well, under FutureOS you can use OS function 'F_MOVE' to copy blocks. It needs 5 us / Byte due to using a batch of either LDI or LDD instructions. AFAIK only custom code it more quick, stuff like: POP BC:LD (HL),C:INC L:LD (HL),B ... (4 us / Byte).

MaV · 16:28, 21 May 23

"Effective frequency" is complete BS. The CPC's effective speed is perhaps close to 3.2 MHz, but its clock frequency is and has always been 4 MHz.

litwr · 17:25, 21 May 23

Thanks. It seems I selected a wrong chapter, the Programming chapter would be more proper.

litwr · 17:29, 21 May 23

Yes, LDI/LDD is more general but if you need the maximum transfer rate the mentioned code may be the proper means.

McArti0 · 19:43, 21 May 23

NOP, INC r ... is in 4MHz

POP BC:LD (HL),C:INC L:LD (HL),B : INC L... (4,5 us / Byte)

GUNHED · 14:54, 22 May 23

Quote from: MaV on 16:28, 21 May 23"Effective frequency" is complete BS. The CPC's effective speed is perhaps close to 3.2 MHz, but its clock frequency is and has always been 4 MHz.

Yes, 4 MHz. And as long as you use for example only Z80 opcodes with no GA-induced delay, then the 'effective frequency' can be 4 MHz in real life. For example: LD A,B:EX DE,HL:ADD A,H ... and all that.

All depends how one does the coding.

GUNHED · 14:57, 22 May 23

Quote from: McArti0 on 19:43, 21 May 23NOP, INC r ... is in 4MHz

POP BC:LD (HL),C:INC L:LD (HL),B : INC L... (4,5 us / Byte)

Like I already wrote in my previous post, but be carefull. Once in a while it needs a INC HL (one extra us!).

Longshot · 01:25, 23 May 23

Quote from: litwr on 10:47, 21 May 23Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.

"POP HL" lasts 3 µsec and "LD (adr),HL" lasts 5 µsec.
4 µsec per byte, but with the drawback of an absolute pointer.
You can find additional information here, chapter 4.4.4 and chapter 25.
http://logonsystem.fr/down/ACCC1.5-EN.pdf

litwr · 11:15, 23 May 23

Quote from: Longshot on 01:25, 23 May 23
Quote from: litwr on 10:47, 21 May 23Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.
"POP HL" lasts 3 µsec and "LD (adr),HL" lasts 5 µsec.
4 µsec per byte, but with the drawback of an absolute pointer.
You can find additional information here, chapter 4.4.4 and chapter 25.
http://logonsystem.fr/down/ACCC1.5-EN.pdf

My gosh! I missed this great compedium. Thank you very much. I was sure that the CPC extends cycles very plainly, to the first multiple of four if the number of ticks is not a multiple of four. So I was sure that LD (addr),HL is 4 NOPs.

Optimus · 17:13, 23 May 23

<rant on>

Not of a fan of this factoid being spread around that CPC has Z80 not at 4mhz but something like 3.2 or 3.3 or 3.5.
It has transfered even in well known sites, like for example I see on https://www.chibiakumas.com/z80/AmstradCPC.php in a well presented table stating CPC has 3.5Mhz Z80.
That's absolutely wrong and the way it's presented in various places, you'd think it's the truth, like it truly has that many cycles per second.

We know it's because of the CPC hardware rounding up it's instruction cycles, sure.
But it's the same as if I spread the rumour outside that an Amiga 500 has a 68000 running at 2.5Mhz instead of 7.16mhz because of slow chip-ram, or a 386DX at 40mhz, is really clocked at 27mhz because all kinds of bottlenecks. But I am not even mentioning this, I am describing it as a fact that these platforms come underclocked like this. It's also arbitrary, so even that statement is as wrong as saying Doom is 2.5D (which means nothing to me).

</rant off>

Optimus · 17:47, 23 May 23

Oops,. other mentioned the same thing,. I think sometimes I ran on "AKCZUAHLY" energy

p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.

Prodatron · 18:29, 23 May 23

Quote from: Optimus on 17:47, 23 May 23p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.

That's true

I don't begrudge the MSX guys for calculating their cycles in a more complicated way.
Best are the Enterprise 64/128 guys, which have "slow" (first 64K, shared with the video chip) and "fast" ram (>64K, Z80 runs at full speed) (like the Amiga).

andycadley · 19:03, 23 May 23

Quote from: Prodatron on 18:29, 23 May 23
Quote from: Optimus on 17:47, 23 May 23p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.
That's true I don't begrudge the MSX guys for calculating their cycles in a more complicated way.
Best are the Enterprise 64/128 guys, which have "slow" (first 64K, shared with the video chip) and "fast" ram (>64K, Z80 runs at full speed) (like the Amiga).

As does the Speccy, though understanding it's contention behaviour (for the RAM that is contended) is absolute black magic. And different on different machines. And even different pages of memory on different 128s.

And it depends on how warm the machine is.

News:

LDIR timing