The CPC Revision Zero (Article)

andycadley · 22:18, 16 October 24

Quote from: Bread80 on 21:16, 16 October 24As for the thread, I was expecting an interesting discussion about what's actually on the board :shrug:

As your original article notes, the board is barely functional and key control signals weren't even wired up. It seems pretty obvious that the "designers" just didn't really have much of a clue what they were doing and it's not much of a surprise that Amstrad ditched it in favour of getting someone with some actual experience to build something.

ZorrO · 23:18, 16 October 24

The only really slow function in BBC Basic is LOG. And if benchmark uses it, CPC turns out to be a bit faster. But this is a rarely used function and without it BBC Micro has 2 times faster Basic than CPC.

And as for assembler, I'm no expert, but from what I've read, comparing Z80 clocked 2 times faster than MOS, single-byte calculations or copying data byte by byte are faster on MOS, but when we do it with numbers and registers 16-bit, which MOS doesn't have, so it has to do them in installments, Z80 is faster in counting and copying on 16-bit registers. Z80 clocked 4 times faster even does 8-bit calculations faster. And I have no idea which of them 8/16bit appear in code more often.

PulkoMandy · 12:41, 17 October 24

As Bread80 says, the limiting factor in most 8 bit machines is not the CPU, but the memory bandwidth. They all deal with this in different and interesting ways.

On the CPC, the CPU is slightly slowed down. On the C64, it is slowed down only on some lines while the data is copied to some internal video chip memory. On the Alice 32 and VG5000, you have separate video RAM not accessible directly by the CPU. On Thomson TO and MO machines, the memory is 16 bit wide so that the video logic can fetch 16 bits at a time, and the CPU accesses one of the other banks (and later on they changd this to use "page mode", so the video logic can do two 8 bit accesses to related memory addresses in quick succession).

And that's just a few examples I know of.

Without taking this into account, it's difficult to compare just the CPU. Yes, maybe in theory one can run some loops and computations slightly faster, or use less instructions. But then you have to take into account the interaction with other logic on the motherboard and it gets a bit more complicated. So you have to run benchmarks on a complete machine, and transplanting a CPU in a machine designed for another is biasing your results if the goal is comparing the CPU (you are testing one in its native environment, and the other in an "alien" environment).

So, what could a 6502 CPC have looked like? We can have only speculations. The motherboard shows that the people attempting to do it might not have had a clear vision about it either. And on the contrary, the production machines show, I think, that MEJ had a pretty good idea what they were doing and good understanding of the chipset they used. Adn that's what makes a machine fast and cheap, not any particular choice of CPU.

andycadley · 15:19, 17 October 24

Quote from: PulkoMandy on 12:41, 17 October 24On the C64, it is slowed down only on some lines while the data is copied to some internal video chip memory. On the Alice 32 and VG5000, you have separate video RAM not accessible directly by the CPU. On Thomson TO and MO machines, the memory is 16 bit wide so that the video logic can fetch 16 bits at a time, .

The C64 also cheats slightly in this regard, in that it's "colour RAM" (which is only 4 bits wide) has a second port connected directly to the VIC so that the graphics hardware can read from it without disturbing the CPU, effectively giving it 12-bit video bandwidth. Although that has the downside that you can't move the colour information around in RAM like you can everything else - this is why a lot of C64 games stick to a consistent three colour background, so they don't need to shift data around as much.

eto · 20:24, 17 October 24

Quote from: Bread80 on 21:16, 16 October 24As for the thread, I was expecting an interesting discussion about what's actually on the board :shrug:

I guess there is not much to discuss as there is not very much on the board that is "surprising" as it seems to be very similar to the architecture we have in the final product.

Actually that's the only surprising aspect for me that they already had the core architecture when they still thought about the 6502.

It also shows how simple the architecture was and that the only reason we ended with a machine that we love was the genius that has been put into the GateArray.

SerErris · 11:52, 18 October 24

Some cents from my end on that debate.

The Z80 has a lot more registers that even can be used in different modes. The RAM models are vastly different and the 6502 actually have no 16bit operations at all, so even no 16bit load operations. Even stack is always 8bit operation, with the only exception of the call and return operations. They need to fetch two bytes for the address...

The 6502 has many other limitations like pages, where effectively you need to watch out that you are not getting over a page limit with your code, or things will getting really bad.

Also the 6502 had no IO commands, which means the IO interfaces Always needed to be memory mapped. That is not a problem if you have lots of anyhow unused ram, however in the time and age of 64 kbyte that was an issue. Maybe a minor thing anyhow.

All in all, to the skilled assembler programmer and looking at pure CPU benchmarks, I would say that the formula 6502 1mhz = z80 2mhz. However as soon as 16bit caculations come into place - that is all lost and Z80 will get its overhand. Esp if you make massive use of registers you can save a lot of memory cycles here.

Just for quick comparison:

Code Select

6502@2Mhz (1 clock = 1 cycle)       Z80@4Mhz (4 clocks = 1 cycle)
LDA #imm = 2 cycles(1µs)            LD  A,#imm = 2 Cycles (2µs)
ADC B = 2 cycles (1µs)              ADC A,B = 1 Cycle (1µs)
ADC Zeropage = 3 cycles (1.5µs)     ADC A,Memory = 2 Cycles (2µs)
ADC $FFFF (4 cycles) (2µs)          ADC A,($FFFF) = 6 Cycles (3µs)
  ; This operations does not exist on Z80, would be
  ; LD B,A  (1 cycle)
  ; LD A,($FFFF) (4 cycles)
  ; ADC A,B (1 cycle)

Looks like the 6502 is much faster, right?

However if you have the address in HL, you could do:
ADC A,(HL) (2 Cycles) 2µs
; this will speed up the process dramatically, esp in loops.
; and the processor commands like DNJZ are a really dramatic improvement.

Adding two 16 bit numbers from memory and then storing them back into a 3rd position really painful:
6502
LDA N1LO (4 cycles)
CLC (2 cycles)
ADC N2LO (4 cycles)
STA RSLTLO (4 cycles)
LDA N1HI (4 cycles)
ADC N2HI (4 cycles)
STA RSTLHI (4 cycles)

Total (26 cycles = 13 µs)

Now how does z80 do?
LD HL,(N1) (5 NOPs)
LD DE,(N2) (6 NOPs)
ADD HL,DE (3 NOPs)
LD (RST),HL (5 NOPs)

Total (19 NOPs = 19 µs)

So even there - the 6502 has a slight gain.

If we now compare C64 with CPC that will result in
26µs C64
19µs CPC

Consider that in a loop and it will very fast add up.

But again all of that does not tell anything about the speed of the system. You would programm the 6502 taking the Zero page into account pretty much for everything, and then it is fast. If it is used similary like a Z80, it will be very slow. That same can be said the other way around. If you just would ignore all 16 bit registers and just use A and B, that would be dramatic slow and inefficient as memory cycles are slow on the Z80.

All in all - lots of discussions with no outcome ...

Lets put it in perspective:
Both are dead slow compared to anything we have today, even the tiniest microcontroler with atmel definition. Hell Multiplication/division anyone? Or Floatingpoint?

andycadley · 13:17, 18 October 24

Quote from: SerErris on 11:52, 18 October 24Just for quick comparison:

Code Select Expand
6502@2Mhz (1 clock = 1 cycle) Z80@4Mhz (4 clocks = 1 cycle)

I mostly agree with your points above, but I'll point out the maths error here. Either you meant to write 6502@1Mhz or you need to double the number of cycles for Z80.

SerErris · 13:37, 18 October 24

No, that is exactly how it is - the number of cycles are measured differently

A cycle on 6502 is a single clock.
A cycle (M cycle) on Z80 is 4 clocks (lets ignore the potential optimized commands for now).

So on a 1Mhz 6502 a single cycle would take 1µs. On a 2Mhz 6502 it would take 0.5µs.

Interestingly is (found it out right now) NOP is also taking 2 cycles on 6502.

So a NOP will be 2µs@1Mhz and 1µs@2Mhz.

The Z80 is also measured in NOPs - which is easier to calculate - so 1 NOP = 4 clock cycles = 1 M cycle = 1µs@4Mhz.

So that is the proof. The CPC can process NOPs twice the speed of a C64

That makes the CPC twice as fast as the C64 on doing nothing

SerErris · 13:43, 18 October 24

One last note:

I would consider the Z80 speed to be 1/4 of the clock cycle. So it is actually a 1Mhz CPU - that is also the minimum time spend on any opcode and any opcode is actually multiples of that. This is esp. true on a CPC, that syncronizes the CPU to 4 clock cycles per operation.

If we look on the 6502 for opposite, it can use multiples of clock directly for different opcode timinings (e.g. 2,3,4 ...).
So the speed of the 6502 is really at 1Mhz, where the Z80 is not really at 4Mhz, but very similar to a 1Mhz clock with 4 subdivisions .. if you see my point.

And if you then compare both - the Z80 is actually quite some bit faster than the 6502. But again that is just my opinion on how to compare stuff.

andycadley · 13:46, 18 October 24

Yes, but that's exactly my point. If 1 cycle on the 6502@2Mhz takes 0.5µs and 1 cycle on the Z80 take 1µs, then you can't directly compare cycle counts. The Z80 is taking twice as long to execute the same number of cycles (how long a NOP is doesn't really matter).

It works fine like that if you're talking about a 6502@1Mhz, such as in the C64.

GUNHED · 16:37, 18 October 24

You just never admit if somebody else is right, do you?

andycadley · 17:26, 18 October 24

Quote from: GUNHED on 16:37, 18 October 24You just never admit if somebody else is right, do you?

Dude, there are some things which are opinions and everyone can have their own. This one is just basic maths.

1000 cycles at 1us per cycle takes 1 second. 1000 cycles at 0.5us per cycle takes 0.5 seconds. Surely you can see that? It's not like you have to say the 6502 is better or anything?

GUNHED · 13:33, 19 October 24

Quote from: andycadley on 17:26, 18 October 24
Quote from: GUNHED on 16:37, 18 October 24You just never admit if somebody else is right, do you?
Dude, there are some things which are opinions and everyone can have their own. This one is just basic maths.

1000 cycles at 1us per cycle takes 1 second. 1000 cycles at 0.5us per cycle takes 0.5 seconds. Surely you can see that? It's not like you have to say the 6502 is better or anything?

Lad, pure math is not everything. With CPUs there's lots more to that. And a portion of humor would be desirable in this case too.

And btw: 1000 cycles at 1us per cycle takes NOT 1 second.
It's 1000000 cycles at 1us per cycle that takes 1 second.
That much about math!

SerErris · 16:53, 19 October 24

Quote from: andycadley on 13:46, 18 October 24Yes, but that's exactly my point. If 1 cycle on the 6502@2Mhz takes 0.5µs and 1 cycle on the Z80 take 1µs, then you can't directly compare cycle counts. The Z80 is taking twice as long to execute the same number of cycles (how long a NOP is doesn't really matter).

It works fine like that if you're talking about a 6502@1Mhz, such as in the C64.

But isnt that actually why I did put in all the µs and not really the cycles?

How cares about cycles?

And yes a 2Mhz 6502 is pretty much the same speed as a 4Mhz Z80 in those operations - this is actually what I said.

So a NOP on a 2Mhz 6502 us 1µs and as well on the Z80@4Mhz.

Anyhow I think the point is clear.

What makes this whole point moot is, that you anyhow will programm them totally different, because of advantages of both architectures.

andycadley · 17:24, 19 October 24

Quote from: GUNHED on 13:33, 19 October 24And btw: 1000 cycles at 1us per cycle takes NOT 1 second.
It's 1000000 cycles at 1us per cycle that takes 1 second.
That much about math!

Quite right, my bad for rushing a post before heading out.

But the point still stands, if your going to invent a unit like "cycle" rather than comparing clocks, it doesn't make sense to make them incomparable.

Quote from: SerErris on 16:53, 19 October 24But isnt that actually why I did put in all the µs and not really the cycles?

How cares about cycles?

And yes a 2Mhz 6502 is pretty much the same speed as a 4Mhz Z80 in those operations - this is actually what I said.

It's really unclear what you're trying to compare in you original post because you keep switching between 1Mhz and 2Mhz 6502 timings. And I'm not sure you can say there the same speed when the 6502 is doing a 16-bit add in 13us Vs the Z80 taking 19us, even though that's arguably where the Z80 is supposed to be stronger.

But yes, this doesn't alter the fact that the CPUs require different approaches to achieve optimal results (hence it being difficult to just do a 1:1 routine timing) nor that changes to the overall architecture can make even more substantial differences (not that makes much difference if we're talking about a 6502 based CPC as per the OP).

GUNHED · 22:03, 20 October 24

Well, is there a game / program / application on a 2 MHz 6502 which is pretty good - so that we can see, if this can be done in a quicker way on the 4 MHz Z80 on CPC?

Of course it shouldn't be something which mainly uses screen access, instead it shall use CPU in a heavy way (and I'm not talking about data-transfer into screen).

My best examples are still freescape and vector games. And in 8 Bit world, the CPC seems to me to be leading.

BTW: In Forth an 16 Bit addition is done this way:
POP HL
POP DE
ADD HL,DE ; This takes 3 * 3us = 9 us only

andycadley · 22:40, 20 October 24

As I said, it's a dumb way of comparing CPU performance to just point at individual games but if you really must then Elite on the BBC B is probably the best version.

And the Atari 8-bit port of Total Eclipse is pretty impressive (even if not quite a 2Mhz CPU)

GUNHED · 12:59, 21 October 24

CPC beats them all

SerErris · 15:24, 21 October 24

Quote from: GUNHED on 22:03, 20 October 24Well, is there a game / program / application on a 2 MHz 6502 which is pretty good - so that we can see, if this can be done in a quicker way on the 4 MHz Z80 on CPC?
Of course it shouldn't be something which mainly uses screen access, instead it shall use CPU in a heavy way (and I'm not talking about data-transfer into screen).
My best examples are still freescape and vector games. And in 8 Bit world, the CPC seems to me to be leading.
BTW: In Forth an 16 Bit addition is done this way:
POP HL
POP DE
ADD HL,DE ; This takes 3 * 3us = 9 us only

Yeah if both values are on the stack? So this is typical fortran/C style - call a function, put arguments on the stack.

That is all good, but you would need to still read the variable into HL/DE and push them, then call the function, and then return.

And then that does not make any sense whatsoever - because still you would need to have painful stack handling to even be able to return to the calling function ... It is a fast add, but it does not load the variables first, which the 19 NOPs version does.

So the question is:

How do we get the values from variable space onto the stack?
How do we call the add routine then
How to handle the stack correctly inside of the add routine, that it calculates correctly and savely returns?

That is all possible and standard technology, but will cost more NOPs to do it.

A typical implementation would more likely to a call to add routine with handover of variables in Registers, and even then you would probably only run add HL,DE inside of the routine, which adds a CALL and a RET to the equation.

All in all as ADD is a single Opcode (even for 16 bit), it does not make any sense to call it, or to push the variables to add to the stack and then call it. Everything will take longer than that.

If you would want a universal add routine, that would probably use the two index registers:

Code Select

main:
    LD IX,(varin1)
    LD IY,(varin2)
    CALL add
    ...
    RET
    
add:
    ;adds two 16-bit numbers from variables indexed by IX,IY
    ;returns result in HL
    LD  L, (IX)          ; Load the lower byte of the first 16-bit number
    LD  H, (IX+1)        ; Load the upper byte of the first 16-bit number

    LD  E, (IY)          ; Load the lower byte of the second 16-bit number
    LD  D, (IY+1)        ; Load the upper byte of the second 16-bit number

    ADD  HL, DE          ; Add DE to HL (HL = HL + DE)
    RET



varOut:
    DW 0
varin1:
    DW 0
varin2:
    DW 0

And really universal you would probably push the addresses of the variables instead of the values - so all in all it will require much more time for the handling around it, instead of directly calculating it.

Prodatron · 15:51, 21 October 24

Quote from: SerErris on 11:52, 18 October 24Adding two 16 bit numbers from memory and then storing them back into a 3rd position really painful:
6502
LDA N1LO (4 cycles)
CLC (2 cycles)
ADC N2LO (4 cycles)
STA RSLTLO (4 cycles)
LDA N1HI (4 cycles)
ADC N2HI (4 cycles)
STA RSTLHI (4 cycles)

Total (26 cycles = 13 µs)

Now how does z80 do?
LD HL,(N1) (5 NOPs)
LD DE,(N2) (6 NOPs)
ADD HL,DE (3 NOPs)
LD (RST),HL (5 NOPs)

Total (19 NOPs = 19 µs)

So even there - the 6502 has a slight gain.

I never coded in 6502, but I can imagine that in many cases the Z80 will still win in exactly this situation (adding 16bits):
- for a 16bit addition the 6502 HAS to load 16bit values from memory and HAS to save them to memory again after adding them; you only have three 8bit registers at all
- the Z80 will load values and then keep them in the registers while working with them. You will hardly see exactly the example above in any code. Only at the very beginning of a function and at the very end you may have these (slow) LD HL,(nn)/(nn),HL instructions or something similiar

A pure 16bit addition takes 3 microseconds for the Z80. If I am not completely wrong, on the 6502 when you use XY for storing one parameter and the result you are not able to do it faster than in 6,5 or 5,5 microseconds. So in this case the Z80 has the double speed

At least these are my thoughts...

andycadley · 16:37, 21 October 24

Quote from: Prodatron on 15:51, 21 October 24A pure 16bit addition takes 3 microseconds for the Z80. If I am not completely wrong, on the 6502 when you use XY for storing one parameter and the result you are not able to do it faster than in 6,5 or 5,5 microseconds. So in this case the Z80 has the double speed

At least these are my thoughts...

IIRC a store to zero page is 3 cycles, a store to a arbitrary memory location is 4. So the quickest you could write a two byte value to RAM is 3us (6 cycles @2Mhz). I'd agree that a Z80 routine that can do multiple 16-bit calculations using only register values should be quicker.

Again micro-benchmarking isn't much better than comparing entire games because there's not enough context to know what a real world scenario would be. Longer, more thought out benchmarking over the many, many times this has come up over the decades has generally settled on a 1Mhz 6502 being roughly on par with a 2.5Mhz Z80 (which tallys up with the Spectrum and CPC generally outperforming the C64 on CPU heavy loads but giving machines like the A8 or BBC B a slight edge).

In all cases the supporting hardware is far more likely to swing any one task, other than just raw number crunching, in favour or against a particular machine depending on how good a fit it is and/or how well the code was optimised for it.

News:

The CPC Revision Zero (Article)