Hi! :)
I am taking a look to one of the games in this week Cepeceros podcast, more exactly to Comando Tracer. This game is using double buffer and is running to 16,7 frames per second. In other words, take 3 frames in filling a screen buffer and making visible. And I has been thinking if it would be possible to optimize the game enough to going at 25 frames per second.
The process of building a screen in any of two buffers ($8000 and $C000) is:
1.- The background is copied to the hidden screen buffer.
(https://www.cpcwiki.eu/forum/index.php?action=dlattach;attach=41661;type=preview)
This is the background for the first level (112x80 pixels in mode 0)
2.- All the sprites are drawn in the hidden screen buffer.
3.- Change the active screen buffer and restart.
The routine for copying the background is this:
ORG $624B
LD HL,$3335 ; Background buffer ( 56 bytes * 80 scanlines)
LD DE,$0204 ; Offset to the screen
LD A,($764A); Variable that signal if we are drawing in $8000 or $c000
XOR D
AND $C0
XOR D
LD D,A
LD BC,$1180
PUSH DE
REPT 56 ; a full scanline unrolled
LDI
ENDR
EX AF,AF'
POP DE
EX DE,HL
CALL $73DC ; Go to the next scanline in the screen memory
EX DE,HL
EX AF,AF' ;'
JP PE,$625C
RET
ORG $73DC
go_down_one_scanline_in_screen_memory
LD A,H
ADD 8
LD H,A
AND $38
RET NZ
PUSH DE
LD DE,$3FC0
SBC HL,DE
POP DE
That take around 26.542 microseconds.
Then my first idea was replacing that for the typical stack blasting routine:
LEN_COPY EQU 14
HIDDEN_BUFFER EQU $3335
ORG $7B56
dump_hidden_buffer
; Destination
LD HL,$0204 + LEN_COPY
LD A,($764A)
XOR H
AND $C0
XOR H
LD H,A
EXX
LD HL,HIDDEN_BUFFER ; Source
LD IYH,80 ; Number of scanlines
DI
LD (.sm_old_stack + 1),SP
.loop_transfer_scanline
REPT 3
; Get source pointer
LD SP,HL
; Update source for the next transfer
LD BC,LEN_COPY
ADD HL,BC
; Get 14 bytes
POP AF
POP DE
POP BC
EX AF,AF' ;'
EXX
POP AF
POP DE
POP BC
POP IX
; POP IY
; Get destination pointer
LD SP,HL
; Transfer first part
; PUSH IY
PUSH IX
PUSH BC
PUSH DE
PUSH AF
; Update destination for the next transfer
LD BC,LEN_COPY
ADD HL,BC
; Transfer second part
EXX
EX AF,AF' ;'
PUSH BC
PUSH DE
PUSH AF
ENDR
; Get source pointer
LD SP,HL
; Update source for the next transfer
LD BC,LEN_COPY
ADD HL,BC
; Get 14 bytes
POP AF
POP DE
POP BC
EX AF,AF' ;'
EXX
POP AF
POP DE
POP BC
POP IX
; POP IY
; Get destination pointer
LD SP,HL
; Transfer first part
; PUSH IY
PUSH IX
PUSH BC
PUSH DE
PUSH AF
; Update destination for the next transfer
LD BC,$0800 - (LEN_COPY * 3)
ADD HL,BC
LD A,H
AND $38
JR NZ,.transfer_last_bytes
LD BC,$C040
ADD HL,BC
; Transfer second part
.transfer_last_bytes
EXX
EX AF,AF' ;'
PUSH BC
PUSH DE
PUSH AF
DEC IYH
JP NZ,.loop_transfer_scanline
.sm_old_stack
LD SP,$0000 ; Restore SP
EI
RET
As you can see everything very straightforward (56 / 4 = 14), but the speed increase is minimal, only 3.000 microsecond less. And not enough for going down one frame. Because there is a zone in $7B56 with 1.194 bytes free in the game I even unroll this stack blasting 8 times more, for transferring in 8 scanlines steps, but that only reduce in 800 microseconds more, but the code size goes from 162 bytes to 997, although we have 1194 bytes free.
Somebody has a nice idea for trying to accelerate this background copy even more.
If you want to test this patch, you only need to change the CALL in $4838 from CALL $624B to CALL $7B56 and of course, put the new routine in $7B56.
A good optimisation would be to restore the background only where the sprites was, instead of all of it.
This is not max speed copy because staking inst. CPC Z80 is slow.
The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L
As
@abalore said, only restore where you need to!
For the "stack blast":
* Use IY as source pointer, IX as destination pointer, HL and HL' for transfert.
* Do not copy in screen order, but 8 times the 10 lines (no "BC26" test).
You should be able to go under
@McArti0 's 4.5 microseconds per byte.
Quote from: McArti0 on 16:02, 12 March 24This is not max speed copy because staking inst. CPC Z80 is slow.
The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L
Thanks McArti0!
LEN_COPY EQU 14
HIDDEN_BUFFER EQU $3335
ORG $7B56
dump_hidden_buffer
; Destination
LD HL,$0204
LD A,($764A)
XOR H
AND $C0
XOR H
LD H,A
LD IXH,80 ; Number of scanlines
DI
LD (.sm_old_stack + 1),SP
LD SP,HIDDEN_BUFFER ; Source
.loop_transfer_scanline
REPT 56 / 2
POP DE
LD (HL),E
INC L
LD (HL),D
INC L
ENDR
; Update destination for the next transfer
LD BC,$0800 - 56
ADD HL,BC
LD A,H
AND $38
JR NZ,.not_overflow
LD BC,$C040
ADD HL,BC
.not_overflow
DEC IXH
JP NZ,.loop_transfer_scanline
.sm_old_stack
LD SP,$0000 ; Restore SP
EI
RET
The routine is faster and compact than the stack blasting one; 21.612 microseconds vs 23.681 with stack and 22.861 with stack and unrolling... but sadly is not enough for increasing the framerate in this game.
Quote from: m_dr_m on 22:24, 12 March 24As @abalore said, only restore where you need to!
Yes, I know that it could be an option, although this game has a lot of sprites and bullets simultaneously on screen, then restoring could not be a win strategy in this case.
But aside of that, this is a patch for an old game, there is not sources available. Then as always, I am trying to make an small patch that can improve the game overall. In this case, trying to improve this background copy for improving the framerate.
Quote from: m_dr_m on 22:24, 12 March 24For the "stack blast":
* Use IY as source pointer, IX as destination pointer, HL and HL' for transfert.
* Do not copy in screen order, but 8 times the 10 lines (no "BC26" test).
You should be able to go under @McArti0 's 4.5 microseconds per byte.
My first test is not more efficient, the problem continue to be in the need of updating both stack pointers between every sequence of POPs and PUSHs, and updating IX and IY is a little slower than HL. It is really late here, but tomorrow I will give another try. Thanks! :)
Can You REPEAT 112 for 224 bytes? 4 lines?
In Space Chicken / Cyber Chicken the screen is never drawn new completely. The game engines does remember which part of the screen were used, and that pars get restored. This is especially effective with non-background GFX, but also a decent speed up with background GFX.
Quote from: McArti0 on 09:51, 13 March 24Can You REPEAT 112 for 224 bytes? 4 lines?
Thanks
@McArti0 :)
I have used too the
@m_dr_m idea of copying 8 blocks of 10 lines; and I have unrolled the copy loop for transferring 560 bytes (56 bytes per scanline * 10 rows); even if the code doesn't enter anymore in the free space, I was able to hack the game enough for not crashing (in the final patch, I will change the background compressor to zx0 and that it should be enough for getting ram for this more hungry routine).
Now the full copy takes 20.400 microseconds. I imagine that we can not go faster than this, then I should take a look to the draw sprite routines in case that I can get enough speed there.
Quote from: GUNHED on 13:31, 13 March 24In Space Chicken / Cyber Chicken the screen is never drawn new completely. The game engines does remember which part of the screen were used, and that pars get restored. This is especially effective with non-background GFX, but also a decent speed up with background GFX.
Thanks Stefan, but in this case we don't have the source, it is only a patch for having some fun ;)
56x80 x 4,5 us=20.160 ms - this is limit.
Each R4 increment is 512 nops :P
Quote from: roudoudou on 15:11, 13 March 24Each R4 increment is 512 nops :P
xDDDDD xDDDDD xDDDD
But if we are going to use dirty tricks, then I can make a really dirty one. I have built an extra upper rom, put there the stack blasting code totally unrolled for both cases, transferring to $8xxx and $cxxx. Takes 14.746 bytes (rom header included) and now the copy takes 18.560 microsecond, that means around 4,14 microseconds per byte xDDDD xDDD xDDD
Quote from: McArti0 on 16:02, 12 March 24This is not max speed copy because staking inst. CPC Z80 is slow.
The forum has already stated:
POP BC
LD (HL),C
INC L
LD (HL),B
INC L
Where in the forum was speeding up copying discussed before? I do not understand the presented outcome and would like to read the discussion about this.
The given command sequence takes 5.5 NOPs per copied byte. Using the stack for reading and writing only needs 4.17 NOPs per byte.
In the example discussed here this makes a difference of 3700 NOPs. But this will still not be enough to increase the frame rate.
Maybe another dirty trick: In an average game play, are all parts of the screen used by sprites? Or are there gaps where no sprite ever appears? This would help to speed up restoring the background without complex book keeping by only caring for the busy parts of the screen.
Even if there is a candidate for such a gap and sometimes, really infrequently, some sprite strays inside this gap, maybe it is enough to restore the background only from time to time (e.g. 1 out of 8 scan lines on each frame). This will cause artifacts around the sprite, but it will not happen often.
All Z80 machine cycles in CPC are 4 clocks.
POP DE - M Cycles 4 T10 (4,3,3) but in CPC 4,4,4
PUSH DE - M Cycles 3 T11 (5,3,3) but in CPC 4,4,4,4 !!!!!! 11 to 16 !!!!
LDIR - M Cycles 5 T21 (4, 4, 3, 5, 5) but in CPC 4,4,4,4,4,4,4 !!!!!! 21T to 28T !!!!
LDI - M Cycles 4 T16 (4, 4, 3, 5) but in CPC 4,4,4,4,4 !!! 16T to 20T.
POP BC 4,4,4
LD (HL),C 4,4
INC L 4
LD (HL),B 4,4
INC L 4
9MC /2 = 4,5us
I considered the special timing on CPC where the cycles are - more or less - round up to the next number divisible by 4.
But I read "INC HL" instead of "INC L" what saves 8 cycles or 2 NOPs in this case.
With this, the given command sequence is compact and fast, almost as fast as using the stack for reading and writing:
LD SP, source
POP AF : POP BC : POP DE : POP HL
EXX
POP BC : POP DE : POP HL
LD SP, target
PUSH HL : PUSH DE : PUSH BC
EXX
PUSH HL : PUSH DE : PUSH BC : PUSH AF
This takes 57 NOPs and transfers 14 bytes, reaching 4.07 µs. But it needs more than four times more place (22 bytes instead of 5 bytes) and contains hardcoded addresses, what means that it cannot easily be used in a loop.