News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_Bryce

CPC Z80 Commands and how long they take...

Started by Bryce, 10:50, 07 January 15

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Optimus

Dirty code is for dirty minds :)

TFM

Quote from: Bryce on 23:32, 08 January 15
Good code is neat, very efficient and well documented. My code manages 1 out of those 3 :D

Bryce.

Ok, I like this discussion! :) 

I understand efficient! And I totally agree: that is for me 100% "good" code.[nb]Personally the only thing for me which is important for coding is efficiency. Stability is a prerequisite. Anything else ... I leave it to the teachers.  :) [/nb]

Documentation is taken strictly not part of the code, because it's not part of the compiled program itself. But it's good manners to document well. However, being able to write efficient code usually lowers the amount of documentation, because the better you get the more the code becomes self explaining.  ;)

Now what is "neat"?

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

rpalmer

Bryce,

I have PDF about the zilog chips CPU, DMA, PIO, etc.

Attached is the Z80 CPU timings PDF. The document shows number of cycles which is the clock cycles while the number of T states is the internal mechanism for execution.

The CPC affects these timings in that the VGA chips causes the CPU to WAIT periodically. I understand this to be due to the 6845 chip needing memory access and the management of the RST instructions.

rpalmer

Bryce

Thanks. Now I have all the documents I need, I just need to find time to use them :)

@TFM: We learnt that writing good code is more than just the program that runs when it's finished:

Efficient: Speaks for itself as you said.
Neat: Means using proper indentation for labels, commands, loops etc so that the structure is easy to read.
Documentation: If documentation is done well, you should be able to hand code over to someone else and they can continue where you left off without having to decypher the code for hours.

Many very good programmers were lousy at documenting, so although the code is efficient, it takes hours to work out exactly how it works and making changes is a really hard job.

Bryce.

TFM

You are of course right in your last sentence for the reasons I suggested before. However I think IMHO the final result is all that counts.
Ok, for teaching purposes it may be interesting to show all that structured crap. But for the best solution all the stuff you learn will not help you that much.
An easy to read structure is usually not the quickest.


About Code reuse: Of course it makes sense from a business POV. But not that much from a efficiency POV.


I remember Odiesofts sources. He used like a lot of people MAXAM in ROM (I still prefer it that way) and his source was a collection of lines with dozens of Z80 instructions in every line. And of course no comments at all. Once he complained himself about it.  :laugh:
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

ralferoo

I guess you already have an answer, but this is the reference I always used for CPU timings: Unofficial Amstrad WWW Resource

Actually, all the information you need is usually in the Z80 manual anyway... Taking a random example: ADD IX,pp: T-states: 15 (4, 4, 4, 3)

Each of those logical groups fits neatly into the CPC's usual 4-cycle cadence, the last one will be stretched to 4, 4, 4, 4. So, 4us.

If you look at, e.g. INC ss: T-states: 6, this gets stretched to 8, so 2us.
Another: RLC (IX+d): 4,4,3,5,4,3. The 3, 5 and 3 all get stretched, so 4,4,4,8,4,4, so 7us.

One of the few exceptions to this rule is e.g. JR cc,e. This is 4,3,5 (taken), 4,3 (not taken). In this case, the 3 and the 5 of the taken aren't stretched when you'd expect. Why not? Well, this actually explains what is actually happening and why things gets stretched in the first case...

Every time the Z80 needs to do a memory access, it is delayed using the WAIT signal (which is asserted on 3 of the 4 cycles). In the usual case where one of the M-states gets extended it's because the next M cycle is a memory access (M cycle actually means memory cycle!). In the case of JR, the 3rd M cycle isn't actually a memory access, but a signed addition to adjust the program counter. As it's not accessing memory, the previous M state isn't stretched after all.

So, it's more correct to not think of it stretching the M cycle, but instead not starting the next one that requires a memory access until the 4th cycle.

If you think of it like this, it also explains the weird exception that happens for interrupt handling. Normally, responding to an interrupt adds 1us on the CPC. That's because it actually just adds 2 T states for the interrupt acknowledge before the next instruction fetch. However, in case where the last M cycle takes 6 T states, the interrupt acknowledge doesn't delay the instruction prefetch and so the usual 1us delay doesn't occur.

"Simples!"  8)

opqa

#31
Not so simple I'm afraid, What happens with LD (nn),HL or OUT (c),r for instance?...

I still haven't found a 100% satisfactory explanation about how the GA 'drives' the Z80 without interfering its normal behaviour. The WAIT pin alone can't be the only mechanism for instance. This signal increases the duration of the read/write input/output cycles, but it doesn't prevent the z80 from driving the address bus neither the data bus (if is a write/output). So either the GA is multiplexing both buses to isolate them from the z80 when it needs them, or it is also issuing BUSRQ signals toghether with the WAITs at the appropiate moments. Does anybody know what's really going on?

Overflow

#32
Very interesting reading.
I didn't know about the exception when interrupting some instructions,
and I never got so deep into M-cycles and T-cycles to try to understand.
Tonight, I had to find such detail doc about z80 to see your numbers.
To have last numbers there so that other guys can get onboard :
LD (nn),HL is (4, 3, 3, 3) -> so why 5 nop-cycles?
LD (nn),HL is (4, 3, 3, 3, 3) -> it's fine
OUT (C),r is (4, 4, 4) -> so why 4 nop-cycles?

EDIT: fixed LD (nn),HL thanks to opqa, a missing 3
some of the messages below refers to this error (striketrough)
Unregistered from CPCwiki forum.

opqa

Mmm, sorry, it seems that I made a mistake with LD (nn),HL. It is really, (4,3,3,3,3) So it fits into the pattern. It seems that OUT (c),r is the only "anomaly". I've my own theory about the reason, but I'm not sure its the right one. Still thinking about it...

rpalmer

The M-Cycles are the clock's fed to the CPU, while the T-States are simply key points into the Z80 CPU micro-code to execute an instruction.

So for LD (nn),HL the micro-code would be something like:

a. Read the Z80 instruction from memory.
b. Read the address to save HL to (i.e nn).
c. Write the address (nn) to the address bus and write a byte of data.
d. Increment the address nn, write the new address to the bus and write the next byte of data

and for OUT (C),r would be something like:

a. Read the Z80 instruction from memory.
b. Write the I/O address in BC to the address bus
c. Write the data onto the data lines and execute the I/O command sequences.

So you can see that the LD (nn),HL has 4 main executions while the OUT (c),r has 3.

The WAIT line can only affect the CPU when the address/data bus has yet to have valid data, since if it could occur then bus conflict would result and the CPC would not likely function properly. The state of M1 line is also used to know when the CPU can be paused.

The GA does use the WAIT line to "PAUSE" the CPU to allow for other functions like 6845 and RST instructions to work and it makes use of the M1 line to ensure that there are no conflicts.

With this information you can see that the Z80 CPU can be "PAUSED" on ANY instruction by the GA and it is this that makes it difficult get timing's for the CPU execution and why the Z80 DMA is almost impossible to get working on the classic CPC.

rpalmer

rpalmer

i also forgot that the IORQ and MREQ lines also assist to determine when the GA can PAUSE the Z80 CPU.

Attached is the Classic CPC 6128 schematic to show which lines the GA is using.

Executioner

Quote from: opqa on 19:32, 10 January 15
I still haven't found a 100% satisfactory explanation about how the GA 'drives' the Z80 without interfering its normal behaviour. The WAIT pin alone can't be the only mechanism for instance.

Actually, the WAIT signal alone is the only signal that stretches the instructions, The /WAIT signal is held low for 3 out of every 4 clock cycles. The Z80 ignores the WAIT signal for most of it's internal processing, but checks it at the point where it needs to do a memory or IO read or write. My latest Z80 core implementation for JEMU doesn't have any CPC specific timing, but the CPC holds the wait line for 3 out of every 4 cycles, and the emulation passes all timing tests, including interrupt timing tests (without any special interrupt timing handling like what was there before).

opqa

#37
@rpalmer
I've found this document where it specifies what you're talking about, it breaks down what the z80 is doing on each Machine cycle.
http://z80.info/z80ins.txt

Forget about LD (nn),HL it was a mistake thinking it was an anomaly. It isn't. It's actually 5 machine cycles that get stretched to 5 nops as usual:
1) Fetch opcode (4 t-states)
2) Fetch parameter (high byte) (3 t-states)
3) Fetch parameter (low byte) (3 t-states)
4) Write memory (high byte in H register) (3 t-states)
5) Write memory (low byte in L register) (3 t-states)

OUT (C),r to the contrary, is made of 3 4 t-states machine cycles but it gets stretched to 4 nops:
1) Fetch prefix (it's an extended instruction) (4 t-states)
2) Fetch opcode (4 t-states)
3) Port Write (4 t-states)

The mechanism that explains this anomaly is much simpler than what you think, as Executioner points , the GA always asserts the WAIT pin 3 cycles out of every 4, and I believe that they are always the same 3 cycles. To understand why OUT (C),r doesn't fit in just 3 nops we must look a little bit deeper, into the t-state level.

@Executioner
Yes, you're right about the use of the WAIT pin. I think that now I understand what's going on. But in my comment I was talking about something else. The GA needs to access memory 2 out of every 4 cycles, and while doing so the Z80 might be in wait state, this doesn't mean is inactive, ehilr in this state it's always trying to drive the address bus, and if it is a write/output op. what's being held, it will also be driving the data bus with the output value.

What I say is that there must be a parallell mechanism to isolate both buses from the memory when the GA is accessing them. And looking at the 6128 schematic it seems that there is. As I suspected, the address bus is multiplexed with the CRTC, and the GA controls those multiplexers, so it decides when the memory sees the z80 addres and when it sees the CRTC address trough them. Also, the data bus is latched/buffered between the z80 and the memory/GA, and again is the GA who controls those, so it decides when to isolate thee z80 bus from the memory for its own operations.

I will explain in a more detailed way how I think it works in a later post. But to advance, I think that the WAIT pin is held active 3 out every 4 cycles. But the z80 is allowed to read/write to memory 2 out of those 4 cycles.

Executioner

On every microsecond (nop) the Gate Array has to either do two reads from memory (2 bytes for 16 bits of screen data) or a single 16 bit read. I don't know, but I suspect it only reads 8 bits at a time. The /WAIT signal is held low for 3 out of 4 cycles, and this causes each memory or IO read/write to be aligned to occur just after WAIT goes high. The document you refer to (z80ins.txt) doesn't actually show exactly what happens during each of those cycles, but the official Zilog Z80 user manual does in the TIMING section starting on page 11. It shows exactly when during each cycle the Z80 tests the /WAIT signal (usually T2 of each cycle, or TW in the case of I/O). The stretching of each cycle is actually more of an alignment of T2/TW with the single 1 of 4 clock cycles where /WAIT is high.

opqa

#39
I know that, but this isn't incompatible with what I'm saying. I'll explain my theory now. First, I'm going to establish a terminology.

Based on the documens avalible about the Z80 (including the Zilog manual itself), or like this one:
http://www.piclist.com/techref/mem/dram/slide4.htm
we know that the timings for opcode fetches, memory accesses and I/O operations are different. Let's show how:

For opcode fetches:

The complete M-cycle last 4 t-states (4 clocks). But the actual opcode fetch takes place during the first 2, which I'll call T1, T2, the other two are used to refresh memory in machines that use this feature of the Z80, the CPC doesn't. I'll call them R1, R2.
As you say, the wait line is sampled only during T2, and if active, the Z80 adds a wait state after it. The actual opcode fetch (the memory access itself) only takes place at the end of the last wait state, or at the end of T2 if none. Note that even after some inserted wait states R1 and R2 must take place for the machine cycle to complete.

So the cycle without wait states would be:


T-State T1 T2 R1 R2
Read        |


And with two inserted wait states (for instance):


T-State T1 T2 Tw Tw R1 R2
Wait        |  |         
Read              |


For memory accesses:

The complete M-cycle of a memory access lasts only 3 t-states, T1, T2, and T3.
But this is because there is no RAM refreshing states. So really, in case it is a slower memory acces. As before, the wait line is sampled at T2, and if active it adds a Wait state just after it. More than one wait state can be added, but the actual memory read is not performed until the end of T3.

For write operations, where the actual memory write takes place depends upon the hardware, it can happen earlier in T2 because the address, the data bus, and all the signals are set  and stable during the middle of T1, but they are kept until ~ the middle of T3. Probably the CPC sample the bus during T2.

So the cycle with no wait states would be:


T-State T1 T2 T3
Read           |
Write      |?


And with inserted wait states:


T-State T1 T2 Tw Tw T3
Wait        |  |   
Read                 |
Write      |?


For I/O operations:

Last but not least, the port I/O operations are the slower of them all. They last 4 full t-states without RAM refreshing. Documents avaliable talk about one "automatically inserted" wait state after T2, and so they call this states T1, T2, Tw, T3. Let's follow this terminology.

Now the WAIT line is sampled during Tw (it might be also sampled during T2 but it would have no effect as the processor  is going to insert Tw' anyway).

As before, reads are sampled at the end of T3, and as before writes will be sampled whenever the hardware is ready. Signals are present from the middle of T1 on.

So without inserted "extra" wait states we have:


T-state T1 T2 Tw T3
Read              |
Write      |? |?


And with a couple of extra wait states:


T-state T1 T2 Tw Tw Tw T3
Wait           |  |     
Read                    |
Write      |? |?


Well, all this about how the Z80 works, in a second post I will explain how this "engages" with the WAIT states inserted by the GA, this will explain the "irregular" timing of OUT (c),r.

opqa

#40
Well now, the second part, how I think all the above "engages" with the GA. In this part some of the things I'm going to tell are just hypothesis and assumptions.

The GA helds the wait signal 3 out of every 4 cycles. I'm going to call G1 to the only state where the Wait signal is inactive, and G2, G3, G4 to the other three.

I strongly suspect that the GA accesses memory during G3 and G4, and that it isolates the Z80 from the memory during G2, G3 and G4. At least partially, but it might be the case that complete isolation is not performed until G3.

The key point in the schematic of the CPC6128 is that the DATA bus is latched in the direction memory -> z80, and that this latch is controlled by the same READY signal that is connected to the WAIT pin of the z80. So if a read is performed in G2, it's going to read the value that was latched in G1. This way the actual value being held by the memory during this cycle might be different.

Anyway, none of the above affects timing, with my notation, the basic timing would be the following:


GA-state  G1 G2 G3 G4
Wait          |  |  |


Let's see how this engages naturally with an opcode fetch for instance.


GA-state  G1 G2 G3 G4
T-state   T2 R1 R2 T1
Wait          |  |  |
Read       |


Note that none of the above wait signals take real effect on the z80, as none is produced during T2, any other is simply ignored. Also note that this is the natural synchronization schema, and if we try any other we we'll end up with this one. For instance, let's guess we begin with T1 in G1:


GA-state  G1 G2 G3 G4 G1 G2 G3 G4
T-state   T1 T2 Tw Tw Tw R1 R2 T1
Wait          |  |  |     |  |  |
Read                   |


If we start with T1 in G3:



GA-state  G3 G4 G1 G2 G3 G4
T-state   T1 T2 Tw R1 R2 T1
Wait       |  |     |  |  |
Read             |


And so on...

In all cases the memory access is taken place during G1. But, let's see what happens when we have an opcode fetch followed by a memory read operation (a typical case):


GA-state  G4 G1 G2 G3 G4  G1 G2 ...
T-state   T1 T2 R1 R2 T1  T2 T3 ...
Wait       |     |  |  |      |
Read          |               |


Now the memory read operation is performed during G2! That's why I said before that the z80 is "allowed" to access memory during 2 out of 4 cycles. It can be either G1 or G2 depending on the current and previous operation.

But, to be honest, and as I said before, what the z80 is really sampling during this clock is the latched sample of the bus that took place in the previous one during G1. So we can really consider that the real memory read takes place always at G1.

In the third and last part, why OUT (c),r doesn't fit in just 3 nops...

arnoldemu

As stated in soft968 both memory and io have wait applied so this stretches out operation.
I believe out (c) has a single t state where wait can be applied for slow devices, I think this is where the delay happens.

In pcw docs it says io doesn't have wait so on pcw up should not take as long.

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

opqa

#42
@arnoldemu

If you mean that the CPC hardware is inserting addional wait states (apart from the ones from the GA). I'm almost sure that this is not the case. It isn't needed at all to explain the timings. Take a moment to read my posts and maybe you'll agree with me.


So let's analyse what happens with out (c),r (or with in r,(c)). This instruction consist of two opcode fetches followed by a I/O port write. As stated before. The I/O port write takes 4 cycles: T1, T2, Tw (automatically inserted), and T3.

So the timings, starting from the second opcode fetch are:


GA-state G4 G1 G2 G3 G4 G1 G2 G3 G4 G1 ...
T-state  T1 T2 R1 R2 T1 T2 Tw Tw Tw T3 ...
Wait      |     |  |  |     |  |  |



Here we have a first wait state after T2 that is introduced automatically by the Z80, not by the CPC hardware, and a second and third ones that are inserted by the GA because of its fixed timing. So this is where the extra NOP comes from.

Now let's analyse why this doesn't happen with out (n),A. This operation consists of an opcode fetch, a parameter fetch, and an i/o port write.

Parameter fetches have the same timing as regular memory accesses, so just 3 t-states T1, T2 and T3. The complete sequence including all machine cycles is:


GA-state G4 G1 G2 G3 G4 G1 G2 G3 G4 G1 G2 ...
T-state  T1 T2 R1 R2 T1 T2 T3 T1 T2 Tw T3 ...
Wait      |     |  |  |     |  |  |     |


So this instruction fits within just 3 NOPs, because the shorter timing of the parameter fetch gives enough space to the i/o operation to complete within the next NOP.

And that's all. What do you think?

ralferoo

#43
It too took me ages until I believed that the WAIT generation was as simple as it was. I was convinced it used the M1 and IORQ signals to do something more complicated, but no, it really is that simple... Hopefully I'll show you why now... :) This is largely the same as opqa's explanation but explained slightly differently...

For reference, this is what I mean by the Z80 user manual: http://tlienhard.com/zx81/z80um.pdf and the important pages start at page 29 (11 in the official numbering).

Instruction Fetch: T1 T2 (checks WAIT) T3 T4 (4+)
Read or Write: T1 T2 (checks WAIT) T3 (3+)
IO: T1 T2 TW (checks WAIT) T3 (4+) - later I'll refer to this as just T1 T2 T3 (checks WAIT) T4

For now, forget the labels T1, T2 etc and just consider the states that check the WAIT signal. So, we now have:
IF: -*--
RW: -*-
IO: --*-

OUT (c),r is described on page 298 (280 official numbering): 4,4,4 (IF, IF, IO)
Writing this out in terms of WAIT states, we have: -*-- -*-- --*-
Aligning this up with the CPC's wait states (I'll add a NOP also, -*--, but any instruction is the same):
                                                                           
                    vv the NOP's fetch T state is stretched by 1 T-state until the next gap
-*-- -*-- --__ _*-- _*--                                                       
1234 1234 12__ _341 _234                                                       
W.WW W.WW W.WW W.WW W.WW                                                       
            ^^ ^^ this T state is stretched by 3 T-states until the next gap   
         
As you can see, 4 extra cycles have been inserted, but it's not as simple as just adding 1us to the IO cycle (even though that is the visible effect, it's actually 2 separate stalls)...

We can also see the difference to OUT (n), A which you might expect to take 4us not 3us...

Page 297 (279 official), shows OUT (n),A as: 4,3,4 (IF, RW, IO) - note the shorter 2nd M cycle... ;)
Writing this out in terms of WAIT states, we have: -*-- -*- --*-

Aligning this up with the CPC's wait states (I'll add a NOP also, -*--, but any instruction is the same):
                                                                           
               vv the NOP's fetch T state is stretched by 1 T-state until the next gap
-*-- -*-- -*-- _*--                                                             
1234 1231 2341 _234                                                             
W.WW W.WW W.WW W.WW
           ^ note that this T state does align perfectly                       
     
So, even though the wait check check occurs 1 T state later in the IO M-cycle compared to the others, because it follows an M cycle, it starts 1 T state earlier and so it aligns perfectly.

Hopefully, that makes things clearer. If I've confused you more, I can try to explain it differently... :)

Executioner

I think that's almost exactly what I was saying. It really is as simple in the CPC as EVERY memory or I/O read/write operation is aligned with cycle n (0..3) of every 4 cycles. The latest JEMU source code proves that a Z80 implementation designed using only the fetch, mem read, mem write, I/O read and I/O write operations with timing exactly as per the Zilog user manual can be made to have exactly the same timing as a real CPC by aligning the /WAIT to one the 4 cycles (I'm not sure which one it's actually high on, you'd have to either test with a CRO from reset or read the GA logic. It could also be determined by the exact position of palette changes etc).

If you look at the way it actually works, it is possible for a NOP (or any other 4 T-State instruction) to actually take 7 T-States to complete. If the previous instruction had been 5, 9, 13, 17 or 21 T-States.

Bryce

Although I asked the question, I stopped understanding the posts in this thread back on page one :D

Bryce.

opqa

Quote from: Executioner on 02:37, 12 January 15
It really is as simple in the CPC as EVERY memory or I/O read/write operation is aligned with cycle n (0..3) of every 4 cycles.
Well, I disagree a little bit about this, as I explain at the end of this post, from the z80 point of view, the read operations can take place either on first or second cycle out of every 4. Opcode fetches will take place always on first cycle, but memory reads and i/o inputs will take place one cycle later. Anyhow this small detail doesn't change the overall timing.

Executioner

Quote from: opqa on 10:07, 12 January 15
Well, I disagree a little bit about this, as I explain at the end of this post, from the z80 point of view, the read operations can take place either on first or second cycle out of every 4. Opcode fetches will take place always on first cycle, but memory reads and i/o inputs will take place one cycle later. Anyhow this small detail doesn't change the overall timing.

Actually op-code fecthes occur in the same cycle as /WAIT goes high, whereas memory reads and IO occur on the next cycle after /WAIT goes high, so the data has to be available 1 T-State after /WAIT goes high. This does suggest that the GA doesn't do memory reads during those two cycles and the Z80 can, but the /WAIT signal still only goes high for 1 cycle. It's only an assumption, but I'd think the internal operation is something like (using your terminology):

G1: /WAIT high, address and data bus multiplexed for Z80 use
G2: /WAIT low, address and data bus still for Z80 use
G3: /WAIT low, CRTC address on address bus, memory read into GA shift register
G4: /WAIT low, CRTC address + 1 on address bus, memory read into GA shift register

Optimus

Quote from: Nich on 22:19, 08 January 15
Or alternatively, "Those who can, do; those who can't, teach."


I recently heard the same quote from a greek friend. Too many teachers and professors here. But few who know how to code or do/did something practical in their career.

Optimus

Quote from: TFM on 23:02, 09 January 15
I remember Odiesofts sources. He used like a lot of people MAXAM in ROM (I still prefer it that way) and his source was a collection of lines with dozens of Z80 instructions in every line.


Strange. I am doing that too. I even hate it when an assembler doesn't support this with semicolon.
It's kinda like grouping many opcodes that do one thing in my mind.
Else, you have those listings with one opcode below the other (and with TABs), but then one single routine could be five pages long, instead of summarized in a single page.
Or when I unroll loops manually, it's one line with many opcodes separated by semicolon. So I copy this line many times. Imagine if every next line of this was a new opcode..


As for the discussion, it reminds me of that little joke that says: Pick any of two: Elegant, Fast and Small.

Powered by SMFPacks Menu Editor Mod