CRTC help

ervin · 07:00, 27 February 13

Hi everyone.

I'm totally stumped.

I've gone through the CRTC posts on cpcwiki looking for clues, and I certainly have learned a bit, but I'm still baffled by it all.

I'm still working on my Chunky Pixel Collision project, but I want to try and make the display faster.
And that means using the CRTC to repeat raster lines, instead of doing them all in software.

Now, the last time this topic came up, it was agreed that it couldn't really be done in my case, as I was only updating parts of the screen that had changed from the previous frame to the current frame. This meant that the time to display a raster line was unpredictable.

So, I've rewritten the display routine to refresh an entire raster line instead of only the parts that have changed.
So I'm now refreshing 50 raster lines fully - each one of those 4 lines apart.
(Have a look at the attached screenshot to see what I mean).

As a result, the amount of time taken to display each raster line is exactly the same every single time.
Does this make it feasible to use the CRTC to repeat each raster line 3 times?

Can anyone offer any advice as to how I would do this?

Incidentally, using this technique (but without the overheads of maintaining the CRTC stuff), I've got full-screen Savage moving around at approx. 14 frames per second, which is excellent for the type of game I want to create.

Thanks for any help.

ralferoo · 09:46, 27 February 13

Quote from: ervin on 07:00, 27 February 13
As a result, the amount of time taken to display each raster line is exactly the same every single time.
Does this make it feasible to use the CRTC to repeat each raster line 3 times?

Yes, but it requires a lot of CPU hand-holding. You can have a look at the source for my Sugarlumps demo if you like: ralferoo/sugarlumps · GitHub which does exactly this, except I repeat every 4 lines not 3.

As you know, to get a line to repeat forever, you need to set R1 > R0 so that the latch that triggers the end of display line circuitry isn't active. And to get back to normal, you restore R1 to some value less than R0.

The trick is very simply to set R1>R0, wait 2 lines, set R1<R0, wait 1 line, repeat.

Timing doesn't need to be massively accurate, you just need to make sure that you hit the registers when the horizontal position is between 0 and the lower R1 point. However, you will find it's a real pain-in-the-ass as you need to maintain this for the full visible region of the screen. I solved it by using code generation, so I knew how long each instruction took and add it to my count. As soon as I hit the threshold, I'd emit EXX:OUT (C),C:EXX or EXX:OUT (C),B:EXX (knowing that B was #BD and so greater than R0). You then subtract another 6us for the outs and add the delay to the next event (in your case 128us or 64us).

You can do conditional branches with this technique, but you need to carefully ensure that both branches take the same time. I've done that a bit in my code for the triangle edge drawer, but for the bits of code where I need a lot of conditionals, I save that work until after the "visible" area of the screen. In my case, I have a scrolling message that's deliberately quite big to reduce the area I need to redraw on the main screen as it's effectively not the visible area of screen any more as I don't need to worry about precise timing there either.

You also have interrupts to consider. You could just disable them to make counting easier, or you can enable them and count the time required in your calculations. Note, you'll also have to deal with jitter as the interrupt will only happen at the end of an instruction. I explicitly manage exactly when an interrupt happens to the cycle, by keeping interrupts disabled until just after I've changed R1 and then disabling them again immediately afterwards. If you keep interrupts disabled for longer than 32 lines then you'll have problems with interrupts occurring in the wrong places, so if you want to use the intterupt that happens during vysnc to resync (and I'd advise doing that) then you need to consider this.

arnoldemu · 10:11, 27 February 13

There is no way to do this with "0%" cpu time. You need to set crtc values every line. If you did this you'd be left with just the upper/lower border time to do your calculations.
I don't know if that is going to be enough?

ralferoo · 13:07, 27 February 13

Quote from: arnoldemu on 10:11, 27 February 13
There is no way to do this with "0%" cpu time. You need to set crtc values every line. If you did this you'd be left with just the upper/lower border time to do your calculations.
I don't know if that is going to be enough?

The bottom/top border area on a normal full-size screen is 36% of the CPU time of the frame, so that should be plenty to do your variable code-path calculations.

I'd have thought the majority of the time in most games would be drawing the sprites, which tends not to have any conditionals if the code unrolled and that can be interleaved with the CRTC-fiddling code. If you split it into logical blocks, like "drawing a line of this sprite takes 60us (e.g. 12 LDI)", and you know you have 12us every 3 lines to fiddle with the CRTC, you can interleave the code for exactly 3 render lines into one chunky line of code. But, really it all depends on exactly what's being rendered in this particular case.

But a good trick is to do all the calculations in one place and just make a list so that the renderer has all the information it needs so you're only interleaving render code with CRTC code not render and calculations. For example, my line renderer does all the initial calculations upfront and then saves that such that each block of interleaved code (21us) is exactly one pixel line step.

ervin · 13:25, 27 February 13

That's absolutely brilliant information ralferoo - very very helpful indeed!
Might take a while for me to understand it all, but that's ok.

Indeed, I realise that there's no way to do this with no cpu time.
I'm just hoping that I manage to make the program run faster than the pure software version of repeating lines.

If not, ah well. At least I will learn something new!

arnoldemu · 14:43, 27 February 13

Games are more variable in that there could be a different number of sprites on the screen, and that the amount of time it takes to perform an ai update can also take more time.

ervin · 23:03, 27 February 13

That's not a problem, fortunately, as all my sprite updates are written to a 4K buffer (and yes this part is definitely variable), but then the contents of that 4K buffer are LDI'd to the screen in a very predictable fashion.

The LDI'ing takes the same amount of time, every time.
So theoretically, using the CRTC to repeat raster lines should be very doable.

I just need to figure out how!

db6128 · 23:54, 27 February 13

How many NOPs are consumed by your current method of rendering?

ervin · 02:51, 28 February 13

For each raster line I do the following:

ld hl,nnnn
ld de,nnnn
call LDI80

The LDI80 subroutine does this:

ldi (80 times)
ret

So that gives me a total of 1327 t-states, which is around 332 NOPs.
All of that happens 50 times per frame, so we're looking at 16,600 NOPs.

ralferoo · 09:35, 28 February 13

That code is an ideal candidate for interleaving with your CRTC code as it has a very simple repeating pattern.

I'd probably consider reworking the code so it looks something like this though:

Code Select


ld hl,nnnn
push hl
ld de,nnnn
push de
call transfer

transfer:
pop ix
pop de
pop hl
ldi x 80
jp (ix)

It looks like it's doing more work, but actually the benefit of it is that it's simple to shift all the calculations up. So, for example with 2 blocks

Code Select


ld hl,nnnn
push hl
ld de,nnnn
push de

ld hl,nnnn
push hl
ld de,nnnn
push de

call transfer
call transfer

Then you might think, "that's a lot of calls" and do something like:

Code Select


ld hl,0   ; end marker
push hl

ld hl,nnnn
push hl
ld de,nnnn
push de

ld hl,nnnn
push hl
ld de,nnnn
push de

call transfer_all

transfer_all:
pop ix

transfer_loop:
pop de
ld a,e
or d
jp z,transfer_done
pop hl
ldi x 80
jp transfer_loop

transfer_done:
jp (ix)

This looks like it's got worse, superficially. However, you can unroll the main transfer loop to get rid of the jr transfer_loop. Unroll the loop as many times as you need to get very close to an integer number of chunky pixel lines (so in your case, close to a multiple of 3*64-2*6=180us). Don't forget the final jump!

If you're doing this call exactly 50 times, then no problem - this routine will always take a constant time, so maybe you could use b and djnz instead on the bigger blocks. But more generally, you won't know exactly how much data you have, so you'll need several transfer_done targets, each one will continue doing the CRTC work but NOPs instead of LDIs.

Anyway, the key point to take away is that all the calculations are done in a different place to the rendering.

There are many improvements you can make to this. You can have multiple stacks, so maybe have one just for a list of render data. Replacing call:pop ix:jp (ix) with a jump to a known location, not using the ld a,e:or d:jp z if you know exactly how much data you have, etc...

ervin · 23:22, 28 February 13

Thanks ralferoo.
Very interesting... though it'll take some time to digest!

Do you mean that I should be putting crtc code in between these rendering blocks somehow?

Axelay · 10:33, 01 March 13

Quote from: ervin on 02:51, 28 February 13
For each raster line I do the following:

ld hl,nnnn
ld de,nnnn
call LDI80

The LDI80 subroutine does this:

ldi (80 times)
ret

So that gives me a total of 1327 t-states, which is around 332 NOPs.
All of that happens 50 times per frame, so we're looking at 16,600 NOPs.

t-states aren't an accurate reflection of CPU time, the way the CPU and display share access to the memory means the t-states of individual instructions get rounded up to the nearest multiple of 4 (I believe), and in addition, a few instructions take one NOP longer than you might expect from the t-states to execute. LDI is one of those, it takes 5 NOPs, so 80 LDIs will take 400 NOPs rather than 320, which will make a significant difference to how much of the buffer you can copy between the CRTC changes ralferoo is describing.

ervin · 12:54, 01 March 13

I've often wondered if something like that was going on.
I've made some really insane nitpicky optimisations (in tight loops) in the past involving 1,2 or 3 t-states, and it made no difference.

This would explain why!

ralferoo · 13:28, 01 March 13

This is the page I used to refer to http://www.cpc-power.com/cpcarchives/index.php?page=articles&num=65

Nowadays, I just remember them or work them out from the Z80 manual. It's usually just 1us per memory cycle and sometimes an extra one if 16 bit maths is used. There are exceptions though (e.g. IO instructions), so it's worth learning properly.

db6128 · 18:37, 01 March 13

Reference, measured by arnoldemu and Executioner, and reformatted by cpcitor:
Craving for speed ? A visual cheat sheet to help optimizing your code to death.

Executioner · 01:14, 17 October 13

Quote from: ralferoo on 13:28, 01 March 13
This is the page I used to refer to http://www.cpc-power.com/cpcarchives/index.php?page=articles&num=65

Nowadays, I just remember them or work them out from the Z80 manual. It's usually just 1us per memory cycle and sometimes an extra one if 16 bit maths is used. There are exceptions though (e.g. IO instructions), so it's worth learning properly.

I thought I'd clarify the way the CPC actually does it's timing. The /WAIT signal to the processor is held low for 3 out of every 4 cycles of the 4MHz clock. These cycles are used for reading display data, leaving a single cycle out of every 4 where /WAIT is not held low and the Z80 is able to read or write to/from memory or I/O devices (I'll call it the active cycle). This means every instruction fetch (and memory read/write) is aligned to this active cycle.

When an interrupt occurs the Z80 inserts 2 T-States before the interrupt processing begins. If the previous instruction ended with at least 2 cycles before the active cycle (eg. 6 T-State instructions) then the first memory read/fetch of the interrupt will proceed immediately, otherwise it will have to wait for the next active cycle. This is why the interrupt timing appears to differ only for a few instructions.

It's difficult to calculate the CPC instruction timing from the official Z80 timings since you need to know exactly at which point in the instruction a read/write MREQ or IORQ occurs. This is partially documented in the Z80 CPU User Manual where it has the timing section and I think every instruction uses those standard read/write/fetch/input/output cycles in a predictable way.

The latest version of the JEMU Z80 core is a complete rewrite and does not use any timing tables. The same Z80 core is shared with no modifications for ZX80/1, ZX Spectrum, VZ/Laser and CPC with the CPC version holding the WAIT signal.

A few years ago, I had a discussion with someone (cngsoft?) on CPC Zone regarding some code involving interrupts occuring at the wrong time during an LDIR under WinAPE. Unfortunately, this discussion was lost when CPC Zone closed.

ralferoo · 12:51, 17 October 13

Quote from: Executioner on 01:14, 17 October 13
I thought I'd clarify the way the CPC actually does it's timing. The /WAIT signal to the processor is held low for 3 out of every 4 cycles of the 4MHz clock. These cycles are used for reading display data, leaving a single cycle out of every 4 where /WAIT is not held low and the Z80 is able to read or write to/from memory or I/O devices (I'll call it the active cycle). This means every instruction fetch (and memory read/write) is aligned to this active cycle.

At some point, I'll have to actually measure what a real gate array does, but I'm almost certain that it's not as simple as holding WAIT low for 3 out of 4 cycles and bringing it high every 4th cycle.

I don't have the details to hand right now, but from what I remember at the very least, WAIT needs to be high one clock earlier for an IO request than it does for a memory request or you'll end up stretching IO instructions differently to a real CPC. I also have some recollection of needing to consider M1 too as the read cycle occurs in a different cycle to a regular memory read, but it's been about a year since I looked at that code and I'm at work now without access to my source...

I remember my implementation was pretty simple, I think I only need 2 bits of state plus the 2 bits for cycle counter.

I do remember the T80 core was buggy in this area as it samples WAIT at the wrong edge of the clock, which complicated things a little.

arnoldemu · 13:09, 17 October 13

I would really like to see exactly what happens for each instruction and get it written down.

Executioner · 13:21, 17 October 13

Quote from: ralferoo on 12:51, 17 October 13
At some point, I'll have to actually measure what a real gate array does, but I'm almost certain that it's not as simple as holding WAIT low for 3 out of 4 cycles and bringing it high every 4th cycle.

Interesting, but the timing in the latest JEMU code is based on the spec from the Z80 User Manual, and it passes all the instruction and interrupt timing checks in PlusTest with the following main cycle code:

Code Select


  public void cycle() {
    switch(clock++ & 0x03) {        
      case 0:
      case 2: z80.setWait(true); break;

      case 1: z80.setWait(false); break;
        
      default: {  // case 3
        z80.setWait(true);
        gateArray.cycle();
        fdc.cycle();
        psg.cycle();
        if ((audioCount += audioAdd) >= AUDIO_TEST) {
          psg.writeAudio();
          audioCount -= AUDIO_TEST;
        }
      }
    }
  }

Bear in mind the GA implementation hasn't been updated to use the 4MHz (or eventually 16MHz) clock cycles, and I'm actually planning on allowing higher accuracy clock emulation using at least 4 methods (up, high, down, low) rather than just cycle.

gerald · 13:51, 17 October 13

Quote from: ralferoo on 12:51, 17 October 13
At some point, I'll have to actually measure what a real gate array does, but I'm almost certain that it's not as simple as holding WAIT low for 3 out of 4 cycles and bringing it high every 4th cycle.

Well, from trace I've archived, the GA does exactly that : holding WAIT low for 3 out of 4 cycles, whatever the instruction is. IO or RAM.

Quote from: ralferoo on 12:51, 17 October 13
I don't have the details to hand right now, but from what I remember at the very least, WAIT needs to be high one clock earlier for an IO request than it does for a memory request or you'll end up stretching IO instructions differently to a real CPC. I also have some recollection of needing to consider M1 too as the read cycle occurs in a different cycle to a regular memory read, but it's been about a year since I looked at that code and I'm at work now without access to my source...

You should think the other way around. The interaction between the GA and the Z80 are only related to the DRAM access multiplexing.

Basically, each microsecond is cut in three memory access.
One single access for the Z80. It takes 6 16MHz clock cycles, where the two first are with WAITn low.
Two paged access for the GA. It takes 10 16MHz clock cycle, all where WAITn is low.

In all these accesses, CAS signal to the DRAM are toggled on either rising of falling edge of the 16MHz clock.
Also, the Z80 RAM access is always initiated (RAS cycle), whatever the Z80 is doing : IO request or memory access, internal or external. It is only issued (CAS cycle) when a true access is done (ie base/internal expansion ram).

The Z80 just get synchronized by the WAITn signal. GA is the master !

ralferoo · 20:34, 17 October 13

All this talk of memory, if only I could remember my conclusions...

I've just refreshed my memory (enough of these puns!) by looking at pages 12-16 of the Z80 UM at http://www.zilog.com/force_download.php?filepath=YUhSMGNEb3ZMM2QzZHk1NmFXeHZaeTVqYjIwdlpHOWpjeTk2T0RBdmRXMHdNRGd3TG5Ca1pnPT0=

So, if you examine this, every instruction starts on T1.

For instruction fetch (p13), WAIT must be asserted low in T2 (T1+1) to initiate lengthening the cycle beyond the normal 4. We don't want this to happen.

For memory fetch (p14), WAIT must be asserted low in T3 (T1+2) to initiate lengthening the cycle beyond the normal 4. We do want this to happen.

For IO request (p16), WAIT must be asserted low in TW (T1+2) to initiate lengthening the cycle beyond the normal 4. We don't want this to happen.

So, it's quite clear that we can't just assert WAIT in T1+2, because then both IO and memory requests will have extra wait states inserted. This would end up causing an additional 4 cycles stall on IO beyond what is actually seen on the Z80.

For example, consider IN A,(nn), which has 3 M-states comprising of T-states 4,3,4, i.e. instruction fetch (4), port (3), IO (4). If we asserted WAIT low in T3 of the port fetch, then the IO would start on T1 we'd expect. Because it too is lengthened, we'd expect to see effective T-states of 4,4,8 for a total of 4us. This doesn't happen on the CPC, it's still only 3us.

Executioner · 23:48, 17 October 13

Quote from: ralferoo on 20:34, 17 October 13
For example, consider IN A,(nn), which has 3 M-states comprising of T-states 4,3,4, i.e. instruction fetch (4), port (3), IO (4). If we asserted WAIT low in T3 of the port fetch, then the IO would start on T1 we'd expect. Because it too is lengthened, we'd expect to see effective T-states of 4,4,8 for a total of 4us. This doesn't happen on the CPC, it's still only 3us.

The IN A,(n) instruction is executed as follows on JEMU (you can download the source at jemu.sourceforge.net):

1. Fetch the instruction. This does T1, T2, then inserts wait states until /WAIT is high, then reads the byte from memory, then performs T3 and T4.

2. Executes another fetch for the port number. This is one T-State shorter than the op-code fetch, so it does T5, T6, wait (already aligned from previous fetch, so there is no wait here), followed by T7.

3. Reads the data from the port. This cycles for 3 T-States before checking /WAIT, so T8, T9 and T10. There have been for T-States (7-10), so once again, there is no wait states, and there is one more T-State where the port is read, T11.

From this, you can see that if the clock is initially 2 cycles before /WAIT goes high, the whole IN A,(n) instruction takes exactly 11 T-States to execute. Because /WAIT is only low on the 3rd T-State of the 4, there will be one wait state inserted in the fetch of the next instruction, effectively making IN A,(n) 12 T-States on the CPC, or 3us.

News:

CRTC help