I am curious how fast we can copy memory using LDIR. I supposed that LDIR needs 24 ticks on the Amstrad.
So I use formula 3.2*10^6/24 and get 133.33 KB/s. I have also done a practical test
>5000 80
. 5001 21 00 50 LD HL, 5000
. 5004 11 20 50 LD DE, 5020
. 5007 01 00 40 LD BC, 4000
. 500A ED B0 LDIR
. 500C 3A 00 50 LD A, (5000)
. 500F 3D DEC A
. 5010 32 00 50 LD (5000), A
. 5013 C2 01 50 JP NZ, 5001
This test gives an unexpected result about 14s that means 16*1024*128/14 about 150 KB/s. I use ep128emu.
Would anybody help me find out what is the speed of LDIR on the CPC? Many thanks in advance.
Where did you get 3.2 from? The Z80 in the CPC is clocked at 4MHz (or close as)
Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?
LDIR on CPC is (BC-1)x6+5 NOPS (1 NOP = 4 ticks)
If executed with BC=0 then compute NOPS with BC=65536
At 4MHz on CPC the theorical transfer rate is 166.400 bytes per second
You can run faster by unrolling LDI like this https://map.grauw.nl/articles/fast_loops.php
Unrolled 16x like example, you can reach 192.462 bytes per second and even 196.000 bytes per second unrolling 32x
Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?
The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.
But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.
Quote from: roudoudou on 09:17, 21 May 23LDIR on CPC is (BC-1)x6+5 NOPS (1 NOP = 4 ticks)
If executed with BC=0 then compute NOPS with BC=65536
At 4MHz on CPC the theorical transfer rate is 166.400 bytes per second
You can run faster by unrolling LDI like this https://map.grauw.nl/articles/fast_loops.php
Unrolled 16x like example, you can reach 192.462 bytes per second and even 196.000 bytes per second unrolling 32x
Why 166,400? IMHO it must be rather approximately 166,667. For 65536 bytes we get 4000000/(65535*6+5)*65536/4 ≈ 166667.1 bps.
BTW Some crazy people use code like the next to copy bytes
pop hl
ld (addr),hl
pop hl
ld (addr+2),hl
...
This sequence is 15% faster than LDI/LDD. :o
EDIT.
Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.
Quote from: andycadley on 10:37, 21 May 23Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?
The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.
But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.
So, LDI/LDD requires 5 NOPS (20 ticks), doesn't it? I thought it is only 4 NOPS. I am curios why LDI needs one more NOP while MOV r,r doesn't? It is also very interesting why is the 3.2 MHz approximation to the average frequency of the CPU very accurate for many cases?
Quote from: litwr on 10:56, 21 May 23Quote from: andycadley on 10:37, 21 May 23Quote from: litwr on 08:49, 21 May 23Thank you. IMHO It is common to say that the effective frequency on the CPC is 3.2 MHz...
I have disabled interrupts during my test and got about 166 KB/s that corresponds 4*10^6/24. So why do people say about 3.2 MHz?
The CPC introduces wait states periodically that slow instructions down, this is why LDIR takes 20/24 cycles rather than 16/21. So a lazy way of estimating the speed of the CPU is just to pick a slightly slower speed and claim that it's roughly equivalent.
But you're already taking that into account by using the actual CPC instruction length, so you have to use the proper CPU clock speed or you're effectively double counting the slowdown.
So, LDI/LDD requires 5 NOPS (20 ticks), doesn't it? I thought it is only 4 NOPS. I am curios why LDI needs one more NOP while MOV r,r doesn't? It is also very interesting why is the 3.2 MHz approximation to the average frequency of the CPU very accurate for many cases?
It's because of how the gate array arbitrates bus cycles. Essentially it holds WAIT such that the Z80 can only access memory on specific cycles, this has the effect of stretching the internal M-cycles of the CPU to be a multiple of 4 T-states. So whether or not an instruction gets extended (and by how much) is all related to how many times it accesses memory (and precisely when).
3.2 is accurate for many cases, because it's not just a number plucked out of the air, but based on real world comparisons. People timed common routines on different machines, and then came up with rough estimations of how much slowdown each suffered based on the run time (using something like a routine running in uncontended memory on a spectrum as a baseline). So it matches up because that's how it was derived in the first place.
Quote from: litwr on 10:47, 21 May 23 pop hl
ld (addr),hl
pop hl
ld (addr+2),hl
...
This sequence is 15% faster than LDI/LDD. :o
EDIT.
Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.
That sequence is onlyl useful to write to fixed memory locations, while LDI/LDD allows to set the destination address. So they are not functionally equivalent.
Well, under FutureOS you can use OS function 'F_MOVE' to copy blocks. It needs 5 us / Byte due to using a batch of either LDI or LDD instructions. AFAIK only custom code it more quick, stuff like: POP BC:LD (HL),C:INC L:LD (HL),B ... (4 us / Byte).
"Effective frequency" is complete BS. The CPC's effective speed is perhaps close to 3.2 MHz, but its clock frequency is and has always been 4 MHz.
Thanks. It seems I selected a wrong chapter, the Programming chapter would be more proper.
Yes, LDI/LDD is more general but if you need the maximum transfer rate the mentioned code may be the proper means.
NOP, INC r ... is in 4MHz
POP BC:LD (HL),C:INC L:LD (HL),B : INC L... (4,5 us / Byte)
Quote from: MaV on 16:28, 21 May 23"Effective frequency" is complete BS. The CPC's effective speed is perhaps close to 3.2 MHz, but its clock frequency is and has always been 4 MHz.
Yes, 4 MHz. And as long as you use for example only Z80 opcodes with no GA-induced delay, then the 'effective frequency' can be 4 MHz in real life. For example: LD A,B:EX DE,HL:ADD A,H ... and all that. :) All depends how one does the coding. :)
Quote from: McArti0 on 19:43, 21 May 23NOP, INC r ... is in 4MHz
POP BC:LD (HL),C:INC L:LD (HL),B : INC L... (4,5 us / Byte)
Like I already wrote in my previous post, but be carefull. Once in a while it needs a INC HL (one extra us!).
Quote from: litwr on 10:47, 21 May 23Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.
"POP HL" lasts 3 µsec and "LD (adr),HL" lasts 5 µsec.
4 µsec per byte, but with the drawback of an absolute pointer.
You can find additional information here, chapter 4.4.4 and chapter 25.
http://logonsystem.fr/down/ACCC1.5-EN.pdf ;)
Quote from: Longshot on 01:25, 23 May 23Quote from: litwr on 10:47, 21 May 23Actually it is about 43% faster, not 15. I assume that the LD+POP sequence uses 3.5 NOPS per byte, while LDI uses 5 NOPS.
"POP HL" lasts 3 µsec and "LD (adr),HL" lasts 5 µsec.
4 µsec per byte, but with the drawback of an absolute pointer.
You can find additional information here, chapter 4.4.4 and chapter 25.
http://logonsystem.fr/down/ACCC1.5-EN.pdf ;)
My gosh! I missed this great compedium. Thank you very much. I was sure that the CPC extends cycles very plainly, to the first multiple of four if the number of ticks is not a multiple of four. So I was sure that LD (addr),HL is 4 NOPs.
<rant on>
Not of a fan of this factoid being spread around that CPC has Z80 not at 4mhz but something like 3.2 or 3.3 or 3.5.
It has transfered even in well known sites, like for example I see on https://www.chibiakumas.com/z80/AmstradCPC.php in a well presented table stating CPC has 3.5Mhz Z80.
That's absolutely wrong and the way it's presented in various places, you'd think it's the truth, like it truly has that many cycles per second.
We know it's because of the CPC hardware rounding up it's instruction cycles, sure.
But it's the same as if I spread the rumour outside that an Amiga 500 has a 68000 running at 2.5Mhz instead of 7.16mhz because of slow chip-ram, or a 386DX at 40mhz, is really clocked at 27mhz because all kinds of bottlenecks. But I am not even mentioning this, I am describing it as a fact that these platforms come underclocked like this. It's also arbitrary, so even that statement is as wrong as saying Doom is 2.5D (which means nothing to me).
</rant off>
Oops,. other mentioned the same thing,. I think sometimes I ran on "AKCZUAHLY" energy :P
p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.
Quote from: Optimus on 17:47, 23 May 23p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.
That's true :D I don't begrudge the MSX guys for calculating their cycles in a more complicated way.
Best are the Enterprise 64/128 guys, which have "slow" (first 64K, shared with the video chip) and "fast" ram (>64K, Z80 runs at full speed) (like the Amiga).
Quote from: Prodatron on 18:29, 23 May 23Quote from: Optimus on 17:47, 23 May 23p.s. Isn't it nice (even if lost that cycle) that this loss, ended up having to count NOP cycles instead of actual? Helped me a lot to easily sketch in my mind instructions from 1 to 6 NOP cycles,. be aware what's fast and what's to be avoided if possible, for optimizations.
That's true :D I don't begrudge the MSX guys for calculating their cycles in a more complicated way.
Best are the Enterprise 64/128 guys, which have "slow" (first 64K, shared with the video chip) and "fast" ram (>64K, Z80 runs at full speed) (like the Amiga).
As does the Speccy, though understanding it's contention behaviour (for the RAM that is contended) is absolute black magic. And different on different machines. And even different pages of memory on different 128s.
And it depends on how warm the machine is. :laugh: