CPC Z80 Commands and how long they take...

endangermice · 09:46, 28 October 16

Yes, you're quite right - thanks for pointing that out. Hmmm I guess my career as a CPU designer has been cut tragically short!

Docent · 18:19, 30 October 16

Quote from: endangermice on 23:00, 27 October 16
Sorry about the random post earlier - it seems that you can't delete posts on this forum - or I just can't work out how!

So I've been through DJNZ in your Java CPC Z80 class, and I believe the timings can be derived as follows:

DJNZ = 0
======
- 5 T-states. 4 T-states to fetch the opcode including 1 T-state to check wait, 1 T-state in the DJNZ method - for the decrement
- 3 T-states for Fetchbyte including 1 T-state for checking wait

DJNZ != 0
======
- 5 T-states. 4 T-states to fetch the opcode including 1 T-state to check wait, 1 T-state in the DJNZ method - for the decrement
- 3 T-states for Fetchbyte including 1 T-state for checking wait
- 5 T-states for the relative jump (jre)

Checking for wait doesn't take any tstates. Let me explain whats going on here:
Standard z80 instructions execution cycle consists of up to 3 machine cycles. Each machine cycle can contain up to 6 clock cycles (tstates) for M1 cycle and 3-5 tstates for other Mx cycles.
First machine cycle (M1) fetches the opcode and it can take up to 6 tstates:
t1 - pc to address bus,
t2 - reading memory addressed by address bus, pc is incremented.
t3 - transfer opcode read in t2 into instruction register, refresh memory
t4 - decode, execute, refresh memory
t5 (optional) - execute
After fetch opcode optionally there are memory read machine cycles - depending on instruction size it can be 1-3 such cycles (M2, M3, M4).
M2,M3,M4 cycle usually takes 3 tstates:
t1 - pc to address bus,
t2 - reading memory addressed by address bus, pc is incremented.
t3 - transfer byte read in t2 into temporary internal register

The difference between M1 and M2,M3 machine cycles is the timing of MREQ signal - in case of M1, it goes inactive at the beginning of t3 tstate and in case of M2,M3 it goes inactive during t3 tstate.
The important part is that checking for wait doesn't take any tstates - if external device applies wait on the cpu pin during the cpu memory reads/writes or i/o, the execution of the next tstate is held until the device stops signaling.
In case of cpc, the gate array sets wait on cpu to read the screen memory. As it was mentioned earlier in this thread, GA generates waits during 3 of 4 tstates.
The number of wait states impacts all instruction machine cycles that require memory access.
So, in case of djnz, you got:

Code Select


for b=0

clk	1234567890123456
Z80	12345123
GA	-___-___-___
CPC 	1222234512ww3

for b<>0 

clk	1234567890123456
Z80	1234512312345
GA	-___-___-___-___
CPC 	1222234512ww312ww345

To fit to the well known timings of cpc, it seems that GA waitstates are applied differently ie. _-__ and not -___, so the real timing will look a bit differently:

Code Select


clk	1234567890123456
Z80	12345123
GA	_-___-___-__
CPC 	1234512www3

for b<>0 

clk	1234567890123456
Z80	1234512312345
GA	_-___-___-___-__
CPC 	1234512www312345

Why single 1 machine cycle instructions (like di, xor a etc) are not affected by the wait states generated by GA? Their execution timing consists of only M1 cycle, where the address bus is used during t1 and t2 states - if GA wait state is applied starting from t2 state, the cpu doesn't need to wait for memory in t3 and continues execution until next memory access (next opcode or M2)

Code Select


T	1234
GA	_-__
CPC 	1234

Quote from: endangermice on 23:00, 27 October 16
It's interesting that the processor fetches the relative jump address before checking whether b is 0. You could potentially save 3 T-states by not doing this until the value of b has been determined. In this scenario, if b were 0, 3 less T-states would be required.

The timing for djnz suggests otherwise - b is checked before fetching the jump offset. And you will not save 3 tstates by skipping the offset fetch - if b=0, you'll still need to increment pc which will take at least 1 tstate.
I believe that approach you suggested could complicate the cpu pipeline design.

endangermice · 13:19, 01 November 16

Hi Docent,

Firstly, thank you for a very detailed explanation of how the cycle timing works - I think it will help me to under the process a lot more. However I don't quite understand how your diagrams relate to your explanation.

From your explanation of MREQ for M1 and M2,M3,M4 cycles etc. I understand MREQ to work in the following way

M1
==
t1 = MREQ (Memory request)
t2 = MREQ (Memory request)
t3
t4

M2
==
t1 = MREQ (Memory request)
t2 = MREQ (Memory request)
t3 = MREQ (Memory request)

Am I right in assuming that when MREQ is active the CPU intends to read memory during those t-states? I.e. in the case of M1 - t1 and t2 are memory read cycles which could be affected by wait? In the case of M2 - t1, t2 and t3 are all t-states where the CPU expects to read memory and again they could be affected by wait?

If this is the case, I am confused by the CPC part of the diagrams you've drawn for me. If you refer to the b=0 diagram. I believe I understand the part for M1. When the wait is applied, it isn't responded to by the CPU until the t-state that follows. So if wait is applied in t2, the CPU won't respond until t3 and only if t3 is a memory read which for M1 is it not. That explains why M1 in your diagram executes the 5 t-states without any wait being applied.

I am confused the diagram of M2 in the CPC part of the diagram. It appears that in the case of M2, t3 doesn't require memory access. If that is the case, why is there a wait inserted between t2 and t3 (2w3)? I can understand t2 running when wait is applied because the wait was applied during t2 but why is there a single wait before t3 (described as the transfer of the byte read in t2 into a temporary internal register) if t3 is not accessing memory? Should the CPC section be 123451www23 rather than 123451www2w3, similarly for b!=0 should it be 123451www2312345 rather than 123451www2w312345 which would also coincide with the instruction taking 16 cycles rather than 17, bringing it inline with the 4us of execution time expected by the WinAPE timing tests.

I hope this makes sense - I just can't see how the CPU is responding the Gate Array wait signal, it doesn't seem to be stopping at the right time. I think I may not be understanding your diagram properly so if you could explain a bit more I would be very grateful.

Thank you for all your help so far,

Damien

Docent · 20:35, 02 November 16

Quote from: endangermice on 13:19, 01 November 16
Hi Docent,

Firstly, thank you for a very detailed explanation of how the cycle timing works - I think it will help me to under the process a lot more. However I don't quite understand how your diagrams relate to your explanation.

From you explanation of MREQ for M1 and M2,M3,M4 cycles etc. I understand MREQ to work in the following way

M1
==
t1 = MREQ (Memory request)
t2 = MREQ (Memory request)
t3
t4

M2
==
t1 = MREQ (Memory request)
t2 = MREQ (Memory request)
t3 = MREQ (Memory request)

It is generally ok - in M1 cycle MREQ goes off at the beginning of t3, while in M2, M3 it goes off in the middle of t3

Quote from: endangermice on 13:19, 01 November 16
Am I right in assuming that when MREQ is active the CPU intends to read memory during those t-states? I.e. in the case of M1 - t1 and t2 are memory read cycles which could be affected by wait? In the case of M2 - t1, t2 and t3 are all t-states where the CPU expects to read memory and again they could be affected by wait?

If this is the case, I am confused by the CPC part of the diagrams you've drawn for me. If you refer to the b=0 diagram. If MREQ is active for t1 and t2 of M1, why doesn't your diagram show a wait between t1 and t2 of M1? By the time M1 reaches the start of t2, wait is being held low by the Gate Array. t2 is a memory read so the CPU should be waiting, but your diagram shows it reading memory? Why is this?

Similarly, t1 for M2 should be a memory read, yet in your diagram, the CPU is not waiting for the read despite wait being held low by the gate array. Also, you show t2 and t3 for M2 executing during a period where the wait signal is also being held low. How is this happening and why is there a wait between t2 and t3?

I hope this makes sense - I just can't see how the CPU is responding the Gate Array wait signal, it doesn't seem to be stopping at the right time. I think I may not be understanding your diagram properly so if you could explain a bit more I would be very grateful.

Everything will become clear if you take into consideration that:
a. z80 did not test wait signal during t1 state
b. wait states are accounted for in the following cycles
So, when GA signals cpu to wait in t2 state, cpu detects it during t2 and holds execution of next t3 state until wait is not active.

endangermice · 21:08, 02 November 16

Hi Docent,

Thank you for the further explanation, it really helps. I read back through this thread again and after reading ralferoo's post a few times it suddenly clicked and I realised that adding the wait states is actually very simple. Simply put as Richard (Executioner) has said many times, all that's happening is that memory / io accesses are aligned with the wait states imposed by the Gate Array. this has the effect of stretching the instruction lengths in some circumstances (if the instruction's execution involves the access of memory or io.)

I was struggling because my impression was that the Gate Array holds the 1st cycle of every 4 high for io / memory access. I think as ralferoo states, it's actually the 2nd cycle in every 4 that is held high. If you work on that assumption, DJNZ b<>0 becomes 16 cycles not 17 which fits the 4us instruction timing perfectly. The wait state timing is therefore w.ww w.ww w.ww w.ww w.ww (wait low, wait high, wait low, wait low).

I have begun to implement this in my Z80 emulation and most of the timing tests are now passing. The ones the aren't are likely due to coding errors on my part and it's also possible I still have a few issues with the interrupt timing.

If I can at least get the instruction lengths correct and the Gate Array wait states working then I only have to work on the interrupt timing and it should all come good!

I'm sure I will have some further updates and likely more questions.

Thanks for all the help!

Damien.

endangermice · 20:05, 03 November 16

So a bit of an update. After reworking my Z80 emualtion, I now have most timings passing the test - hooray! However, I'm having trouble with the DD codes. Seeing as DD is a prefix instruction, two fetches have to be made which is where the 2us timing is derived for instructions such as nop (1us per fetch). Say we're looking at a prefixed nop so 0xDD followed by 0x00. If we go by ralferoo's description from the Z80 user's guide that a fetch can be represented as -*-- where cycle 1 is the memory read, this is only going to take the expected 2us when the first fetch is exactly aligned to the first cycle, cycle 0 in the 4. e.g.

Code Select


-*-- -*--
w.ww w.ww
0123 0123

if the first fetch starts on say cycle 3 (due to the previous instructions ending before cycle 3), we get a situation like the following:

Code Select


- _*-- -*--
w w.ww w.ww
3 0123 0123

The first fetch has been stretched by 1 cycle so now the whole instruction takes 9 cycles to complete, which rounds up to 3us. This is exactly what is happening when I run the DD codes through my emulator. This also means that for instructions with tight timings like this i.e. anything that's an exact multiple of 4 cycles, it's not possible to predict the instruction timing. If 0xDD was 3us I think this would fit no matter what the starting cycle.

I obviously have something wrong here, but I'm just trying to apply ralferoo's workings to this situation and in practice, during the PlusTest on my emulator, the DD codes seem to start execution on cycle 3. To get consistent readings the start cycle shouldn't matter so what's going on here?

Docent · 17:00, 04 November 16

Quote from: endangermice on 21:08, 02 November 16
Hi Docent,

Thank you for the further explanation, it really helps. I read back through this thread again and after reading ralferoo's post a few times it suddenly clicked and I realised that adding the wait states is actually very simple. Simply put as Richard (Executioner) has said many times, all that's happening is that memory / io accesses are aligned with the wait states imposed by the Gate Array. this has the effect of stretching the instruction lengths in some circumstances (if the instruction's execution involves the access of memory or io.)

I was struggling because my impression was that the Gate Array holds the 1st cycle of every 4 high for io / memory access. I think as ralferoo states, it's actually the 2nd cycle in every 4 that is held high. If you work on that assumption, DJNZ b<>0 becomes 16 cycles not 17 which fits the 4us instruction timing perfectly. The wait state timing is therefore w.ww w.ww w.ww w.ww w.ww (wait low, wait high, wait low, wait low).

Hi Damien,
Good catch! The timing example I described earlier was indeed based on assumption that the GA sets wait states starting from the second cycle but only _-__ GA wait state cycle seems to work for all timings.
btw: I also made a mistake in applying waitstates to M2, M3 cycles - somehow I assumed wait states are checked from t1 state and not from t2 like they are in reality. and I even wrote that later

Anyway, it seems that you got this working.

Docent · 18:35, 04 November 16

Quote from: endangermice on 20:05, 03 November 16
So a bit of an update. After reworking my Z80 emualtion, I now have most timings passing the test - hooray! However, I'm having trouble with the DD codes. Seeing as DD is a prefix instruction, two fetches have to be made which is where the 2us timing is derived for instructions such as nop (1us per fetch). Say we're looking at a prefixed nop so 0xDD followed by 0x00. If we go by ralferoo's description from the Z80 user's guide that a fetch can be represented as -*-- where cycle 1 is the memory read, this is only going to take the expected 2us when the first fetch is exactly aligned to the first cycle, cycle 0 in the 4. e.g.

Code Select Expand
-*-- -*-- w.ww w.ww 0123 0123

if the first fetch starts on say cycle 3 (due to the previous instructions ending before cycle 3), we get a situation like the following:

Code Select Expand
- _*-- -*-- w w.ww w.ww 3 0123 0123

The first fetch has been stretched by 1 cycle so now the whole instruction takes 9 cycles to complete, which rounds up to 3us. This is exactly what is happening when I run the DD codes through my emulator. This also means that for instructions with tight timings like this i.e. anything that's an exact multiple of 4 cycles, it's not possible to predict the instruction timing. If 0xDD was 3us I think this would fit no matter what the starting cycle.

I obviously have something wrong here, but I'm just trying to apply ralferoo's workings to this situation and in practice, during the PlusTest on my emulator, the DD codes seem to start execution on cycle 3. To get consistent readings the start cycle shouldn't matter so what's going on here?

You need to account for the remaining t states from tha last machine cycle of a previous instruction. Lets see what will happen in such case:

Code Select


clk	1234567890123456
GA	_-___-___-___-__
T	12341234
T+1	 12www341234
T+2	  12ww341234
T+3	   12w341234

T is the time to execute an instruction like DD00 (8 tstates), when total tstates, taken by the previous instruction is 4 aligned ie 4, 8 etc tstates (generally tstates mod 4=0).
When there is a remainder from previous instruction timing, the first opcode fetch is stretched by waitstates it the way, that it correctly aligns to the beginning of next machine cycle.
Please notice, that no matter how many unused tstates have left from previous instruction, it will start execution of the following instruction correctly aligned to GA waitstates.

endangermice · 19:35, 04 November 16

Hey, thanks for the replies - I think we're getting closer, a lot of very useful info. The only problem I have with

Code Select


clk   1234567890123456
GA   _-___-___-___-__
T   12341234
T+1    12www341234
T+2     12ww341234
T+3      12w341234

Is that in all but T, the instruction is taking more than 8 clock cycles to complete (T + 1 = 11 cycles, T + 2 = 10 cycles, T + 3 = 9 cycles). Therefore, for T+1, T+2 and T+3, the time of execution is going to round up to 3us. When I run the DD code tests through Richard's utility, it always expects DD:00 to be 2us in length. I can't see how this can be possible unless DD is perfectly aligned each time which with previous instructions of different lengths is in my mind far from certain. This is why I'm sure there's something else happening here that I'm not appreciating.

It may be that this is expected behaviour and my problem lies elsewhere e.g. timing with the interrupts. It would be useful to know whether DD:00 for example should have a different time of execution depending on the starting cycle. It is interesting that it is both the DD, FD and FD CB tests that are mostly failing. All of the normal instructions, the ED and CB prefixes are passing so I feel that I'm very close. I'm not passing many of the interrupt timing instruction tests which suggests my interrupt timing is out (I've been trying to fix but need to study the mechanism in more detail). This might be having a knock-on effect for the DD tests.

This is all because I want to make the Z80 cycle exact. I could probably get away with us timing, but I really want to get the cycle level emulation right!

Executioner · 23:20, 15 February 17

Sorry I haven't been around for quite a while. Busy moving and other stuff. It really doesn't matter which cycle the WAIT signal is released (0, 1, 2 or 3) as the Z80 will adjust itself to be aligned so that T3 on each fetch, mem and IO request is aligned to the point where /WAIT goes high, so it actually doesn't matter if you use -___, _-__, __-_ or ___-.

The current (and it hasn't changed in at least 12 months) JEMU code passes all the timing tests, but doesn't currently have the SCF/CCF flag test implemented properly. WinAPE doesn't use the same T-State level emulation and instead has a flag to determine which T-State the previous instruction would have ended on which affects the timing for Interrupt requests (they have a 2 T-State time window before a fetch).

The unreleased WinAPE 2.0B3 passes all the tests, including ZEXALL and SCF/CCF tests. I'll try and release the current 2.0B3 version of WinAPE this next week or so as I need to get on with implementing new features that I don't have time to put in this release.

roudoudou · 23:57, 15 February 17

can't wait to read the whatsnew.txt file

m_dr_m · 16:52, 11 January 21

Regarding the CPC timing of Z80 instructions, all the documents linked here (and all others I've seen) contain errors.
I've made a doc (as an Orgams source Z80.O) free of those errors. You can download it here.

It also comes with Z80tests.O, which allow to visually check the timing of any instructions, and demonstrates some undocumented flag behaviours.

GUNHED · 22:38, 12 January 21

Sorry, link doesn't work

EDIT: Works again! Must have accessed at the wrong moment in time twice.

m_dr_m · 22:48, 12 January 21

Hum?!? Works for me! What do you get?

You can try this instead: RC2 DSK

Targhan · 23:27, 12 January 21

The link Gunhed complains about doesn't work indeed: it points on the late push'n'pop website.

m_dr_m · 00:26, 13 January 21

Thanks for the feedback! I don't understand, though.

http://orgams.wikidot.com/working

pelrun · 03:50, 13 January 21

The forum is mixing up links again; for me it's currently pointing to a linkedin post.

@Gryzor, your assistance is needed

m_dr_m · 04:58, 13 January 21

Maybe the link in my signature works, and then go to working section.

Anyway, this link should already be in the bookmarks and the hearts of everyone!

Gryzor · 08:45, 13 January 21

...now it appears to be fixed?

pelrun · 11:31, 13 January 21

It's pointing to www.z88dk.org for me now

Gryzor · 11:35, 13 January 21

What is pointing there though.

pelrun · 11:47, 13 January 21

Sorry, it's the "here" link on https://www.cpcwiki.eu/forum/programming/cpc-z80-commands-and-how-long-they-take/msg196635/#msg196635.

Which periodically changes it's address, not with every page load but maybe when a new comment forces the cache to be updated? And just now it was pointing to the right place again - maybe it only screws around when it knows you're not looking at it

Gryzor · 11:58, 13 January 21

Oh damn it seems I forgot the Heisenberg option set to "on"!

Ok, let me force a cache cleaning...

GUNHED · 12:41, 13 January 21

Quote from: m_dr_m on 04:58, 13 January 21
Maybe the link in my signature works, and then go to working section.
Anyway, this link should already be in the bookmarks and the hearts of everyone!

Indeed and indeed. My problem was probably a mixing up of links, since your post didn't show the link in detail I couldn't see where it was intended to go. It was a forum problem indeed. Please keep you signature though.

m_dr_m · 16:12, 14 January 21

Thanks Gryzor for fixing this!

Thanks Toms for this superb layout.

News:

CPC Z80 Commands and how long they take...