Ideas for optimizing a hidden buffer copy ...

SyX · 15:39, 12 March 24

Hi!

I am taking a look to one of the games in this week Cepeceros podcast, more exactly to Comando Tracer. This game is using double buffer and is running to 16,7 frames per second. In other words, take 3 frames in filling a screen buffer and making visible. And I has been thinking if it would be possible to optimize the game enough to going at 25 frames per second.

The process of building a screen in any of two buffers ($8000 and $C000) is:
1.- The background is copied to the hidden screen buffer.

This is the background for the first level (112x80 pixels in mode 0)

2.- All the sprites are drawn in the hidden screen buffer.
3.- Change the active screen buffer and restart.

The routine for copying the background is this:

Code Select

    ORG  $624B
    LD   HL,$3335 ; Background buffer ( 56 bytes * 80 scanlines) 
    LD   DE,$0204 ; Offset to the screen
    LD   A,($764A); Variable that signal if we are drawing in $8000 or $c000
    XOR  D
    AND  $C0
    XOR  D
    LD   D,A
    LD   BC,$1180
    PUSH DE
    REPT 56      ; a full scanline unrolled 
        LDI
    ENDR 
    EX   AF,AF'
    POP  DE
    EX   DE,HL
    CALL $73DC   ; Go to the next scanline in the screen memory
    EX   DE,HL
    EX   AF,AF' ;'
    JP   PE,$625C
    RET
    
    ORG  $73DC
go_down_one_scanline_in_screen_memory
    LD   A,H
    ADD  8
    LD   H,A
    AND  $38
    RET  NZ
    PUSH DE
    LD   DE,$3FC0
    SBC  HL,DE
    POP  DE

That take around 26.542 microseconds.

Then my first idea was replacing that for the typical stack blasting routine:

Code Select

LEN_COPY        EQU 14
HIDDEN_BUFFER   EQU $3335

    ORG   $7B56
dump_hidden_buffer
    ; Destination
    LD   HL,$0204 + LEN_COPY
    LD   A,($764A)
    XOR  H
    AND  $C0
    XOR  H
    LD   H,A
    EXX
    LD   HL,HIDDEN_BUFFER   ; Source

    LD   IYH,80     ; Number of scanlines
    DI
    LD   (.sm_old_stack + 1),SP    
.loop_transfer_scanline
    REPT 3
        ; Get source pointer
        LD   SP,HL

        ; Update source for the next transfer
        LD   BC,LEN_COPY
        ADD  HL,BC
    
        ; Get 14 bytes
        POP  AF
        POP  DE
        POP  BC
        EX   AF,AF' ;'
        EXX
        POP  AF
        POP  DE
        POP  BC
        POP  IX
;        POP  IY

        ; Get destination pointer
        LD   SP,HL

        ; Transfer first part
;        PUSH IY
        PUSH IX
        PUSH BC
        PUSH DE
        PUSH AF

        ; Update destination for the next transfer
        LD   BC,LEN_COPY
        ADD  HL,BC

        ; Transfer second part
        EXX      
        EX   AF,AF' ;'
        PUSH BC
        PUSH DE
        PUSH AF
    ENDR

    ; Get source pointer
    LD   SP,HL

    ; Update source for the next transfer
    LD   BC,LEN_COPY
    ADD  HL,BC
    
    ; Get 14 bytes
    POP  AF
    POP  DE
    POP  BC
    EX   AF,AF' ;'
    EXX
    POP  AF
    POP  DE
    POP  BC
    POP  IX
;    POP  IY

    ; Get destination pointer
    LD   SP,HL
    
    ; Transfer first part
;    PUSH IY
    PUSH IX
    PUSH BC
    PUSH DE
    PUSH AF
    
    ; Update destination for the next transfer
    LD   BC,$0800 - (LEN_COPY * 3)
    ADD  HL,BC
    LD   A,H
    AND  $38
    JR   NZ,.transfer_last_bytes
    LD   BC,$C040
    ADD  HL,BC

    ; Transfer second part
.transfer_last_bytes    
    EXX      
    EX   AF,AF' ;'
    PUSH BC
    PUSH DE
    PUSH AF
    
    DEC  IYH
    JP   NZ,.loop_transfer_scanline
.sm_old_stack
     LD SP,$0000     ; Restore SP
     EI
     RET

As you can see everything very straightforward (56 / 4 = 14), but the speed increase is minimal, only 3.000 microsecond less. And not enough for going down one frame. Because there is a zone in $7B56 with 1.194 bytes free in the game I even unroll this stack blasting 8 times more, for transferring in 8 scanlines steps, but that only reduce in 800 microseconds more, but the code size goes from 162 bytes to 997, although we have 1194 bytes free.

Somebody has a nice idea for trying to accelerate this background copy even more.

If you want to test this patch, you only need to change the CALL in $4838 from CALL $624B to CALL $7B56 and of course, put the new routine in $7B56.

abalore · 16:00, 12 March 24

A good optimisation would be to restore the background only where the sprites was, instead of all of it.

McArti0 · 16:02, 12 March 24

This is not max speed copy because staking inst. CPC Z80 is slow.

The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L

m_dr_m · 22:24, 12 March 24

As @abalore said, only restore where you need to!

For the "stack blast":
* Use IY as source pointer, IX as destination pointer, HL and HL' for transfert.
* Do not copy in screen order, but 8 times the 10 lines (no "BC26" test).
You should be able to go under @McArti0 's 4.5 microseconds per byte.

SyX · 03:17, 13 March 24

Quote from: McArti0 on 16:02, 12 March 24This is not max speed copy because staking inst. CPC Z80 is slow.

The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L

Thanks McArti0!

Code Select

LEN_COPY        EQU 14
HIDDEN_BUFFER   EQU $3335

    ORG   $7B56
dump_hidden_buffer
    ; Destination
    LD   HL,$0204
    LD   A,($764A)
    XOR  H
    AND  $C0
    XOR  H
    LD   H,A

    LD   IXH,80     ; Number of scanlines
    DI
    LD   (.sm_old_stack + 1),SP    
    LD   SP,HIDDEN_BUFFER   ; Source
.loop_transfer_scanline
    REPT 56 / 2
        POP  DE
        LD   (HL),E
        INC  L
        LD   (HL),D
        INC  L
    ENDR
    
    ; Update destination for the next transfer
    LD   BC,$0800 - 56
    ADD  HL,BC
    LD   A,H
    AND  $38
    JR   NZ,.not_overflow
    LD   BC,$C040
    ADD  HL,BC
.not_overflow
    DEC  IXH
    JP   NZ,.loop_transfer_scanline
.sm_old_stack
     LD SP,$0000     ; Restore SP
     EI
     RET

The routine is faster and compact than the stack blasting one; 21.612 microseconds vs 23.681 with stack and 22.861 with stack and unrolling... but sadly is not enough for increasing the framerate in this game.

Quote from: m_dr_m on 22:24, 12 March 24As @abalore said, only restore where you need to!

Yes, I know that it could be an option, although this game has a lot of sprites and bullets simultaneously on screen, then restoring could not be a win strategy in this case.

But aside of that, this is a patch for an old game, there is not sources available. Then as always, I am trying to make an small patch that can improve the game overall. In this case, trying to improve this background copy for improving the framerate.

Quote from: m_dr_m on 22:24, 12 March 24For the "stack blast":
* Use IY as source pointer, IX as destination pointer, HL and HL' for transfert.
* Do not copy in screen order, but 8 times the 10 lines (no "BC26" test).
You should be able to go under @McArti0 's 4.5 microseconds per byte.

My first test is not more efficient, the problem continue to be in the need of updating both stack pointers between every sequence of POPs and PUSHs, and updating IX and IY is a little slower than HL. It is really late here, but tomorrow I will give another try. Thanks!

McArti0 · 09:51, 13 March 24

Can You REPEAT 112 for 224 bytes? 4 lines?

GUNHED · 13:31, 13 March 24

In Space Chicken / Cyber Chicken the screen is never drawn new completely. The game engines does remember which part of the screen were used, and that pars get restored. This is especially effective with non-background GFX, but also a decent speed up with background GFX.

SyX · 13:47, 13 March 24

Quote from: McArti0 on 09:51, 13 March 24Can You REPEAT 112 for 224 bytes? 4 lines?

Thanks @McArti0

I have used too the @m_dr_m idea of copying 8 blocks of 10 lines; and I have unrolled the copy loop for transferring 560 bytes (56 bytes per scanline * 10 rows); even if the code doesn't enter anymore in the free space, I was able to hack the game enough for not crashing (in the final patch, I will change the background compressor to zx0 and that it should be enough for getting ram for this more hungry routine).

Now the full copy takes 20.400 microseconds. I imagine that we can not go faster than this, then I should take a look to the draw sprite routines in case that I can get enough speed there.

Quote from: GUNHED on 13:31, 13 March 24In Space Chicken / Cyber Chicken the screen is never drawn new completely. The game engines does remember which part of the screen were used, and that pars get restored. This is especially effective with non-background GFX, but also a decent speed up with background GFX.

Thanks Stefan, but in this case we don't have the source, it is only a patch for having some fun

McArti0 · 14:29, 13 March 24

56x80 x 4,5 us=20.160 ms - this is limit.

roudoudou · 15:11, 13 March 24

Each R4 increment is 512 nops

SyX · 15:09, 14 March 24

Quote from: roudoudou on 15:11, 13 March 24Each R4 increment is 512 nops

xDDDDD xDDDDD xDDDD

But if we are going to use dirty tricks, then I can make a really dirty one. I have built an extra upper rom, put there the stack blasting code totally unrolled for both cases, transferring to $8xxx and $cxxx. Takes 14.746 bytes (rom header included) and now the copy takes 18.560 microsecond, that means around 4,14 microseconds per byte xDDDD xDDD xDDD

lightforce6128 · 21:22, 16 March 24

Quote from: McArti0 on 16:02, 12 March 24This is not max speed copy because staking inst. CPC Z80 is slow.

The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L

Where in the forum was speeding up copying discussed before? I do not understand the presented outcome and would like to read the discussion about this.

The given command sequence takes 5.5 NOPs per copied byte. Using the stack for reading and writing only needs 4.17 NOPs per byte.

In the example discussed here this makes a difference of 3700 NOPs. But this will still not be enough to increase the frame rate.

lightforce6128 · 21:28, 16 March 24

Maybe another dirty trick: In an average game play, are all parts of the screen used by sprites? Or are there gaps where no sprite ever appears? This would help to speed up restoring the background without complex book keeping by only caring for the busy parts of the screen.

Even if there is a candidate for such a gap and sometimes, really infrequently, some sprite strays inside this gap, maybe it is enough to restore the background only from time to time (e.g. 1 out of 8 scan lines on each frame). This will cause artifacts around the sprite, but it will not happen often.

McArti0 · 21:48, 16 March 24

All Z80 machine cycles in CPC are 4 clocks.

POP DE - M Cycles 4 T10 (4,3,3) but in CPC 4,4,4

PUSH DE - M Cycles 3 T11 (5,3,3) but in CPC 4,4,4,4 !!!!!! 11 to 16 !!!!

LDIR - M Cycles 5 T21 (4, 4, 3, 5, 5) but in CPC 4,4,4,4,4,4,4 !!!!!! 21T to 28T !!!!

LDI - M Cycles 4 T16 (4, 4, 3, 5) but in CPC 4,4,4,4,4 !!! 16T to 20T.

POP BC 4,4,4
LD (HL),C 4,4
INC L 4
LD (HL),B 4,4
INC L 4

9MC /2 = 4,5us

lightforce6128 · 22:00, 16 March 24

I considered the special timing on CPC where the cycles are - more or less - round up to the next number divisible by 4.

But I read "INC HL" instead of "INC L" what saves 8 cycles or 2 NOPs in this case.

With this, the given command sequence is compact and fast, almost as fast as using the stack for reading and writing:

Code Select

LD SP, source
POP AF : POP BC : POP DE : POP HL
EXX
POP BC : POP DE : POP HL
LD SP, target
PUSH HL : PUSH DE : PUSH BC
EXX
PUSH HL : PUSH DE : PUSH BC : PUSH AF

This takes 57 NOPs and transfers 14 bytes, reaching 4.07 µs. But it needs more than four times more place (22 bytes instead of 5 bytes) and contains hardcoded addresses, what means that it cannot easily be used in a loop.

News:

Ideas for optimizing a hidden buffer copy ...