CPCWiki forum

General Category => Programming => Topic started by: redbox on 10:17, 04 June 14

Title: Optimisation Corner
Post by: redbox on 10:17, 04 June 14
Thought I'd start a new thread to discuss ways to optimise routines.

Handy that, as I have one to start it off...  :D

Here is my current tile drawing routine:


;; FnDrawTile

;; Description: Draw a tile
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile data must be aligned to boundary (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       401 NOPs

;; Entry:       HL - screen address, DE - tile data
;; Exit:        None
;; Corrupt:     A, BC, DE, HL
       
    ld a,(de) : ld (hl),a : inc e : inc l   ; line 1
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 2
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 4
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 3
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e
       
    set    5,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 7
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 8
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    res    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 6
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 5
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    ld     bc,&e040                ; HL here is e.g. &E000, we want to...
    add    hl,bc                   ; ...get to &C040, so we add &E040

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 9
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 10
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 12
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 11
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    5,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 15
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 16
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    res    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 12
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 13
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a
                               
    ret


Currently I have to draw up to 4 of these per sprite (20x16 size) per frame using the dirty tilemap method, so any way of making it faster..?

I was thinking of using the stack and SP, but am unsure of how the tile data order would differ.



Title: Re: Optimisation Corner
Post by: ralferoo on 12:38, 04 June 14
The ones with "inc l" and "inc e" you might as well use ldir for - it's 5us instead of 6us for the separate instructions, obviously corrupts C and maybe B.

Using the stack is fine, but you have to consider certain things.... You'd probably want to use the stack as the source, so then you can just pop and carry on doing manipulations on HL (because setting the SP requires going through HL or an explicit load). Using HL as the destination also allows you to do POP BC ; LD (HL),C ; INC L ; LD (HL),B ; INC L

The order is best experimented with. But essentially, BC when popped, C is the lowest address, B is the next address. If you're popping, the order increases upwards, so is more "normal".

Pushing to the screen is a good technique if you need to keep interrupts enabled most of the time, as long as you know that your interrupt handler won't write too much onto the stack. So, e.g. once per line, you could update SP to the next bit of data to write to, enable interrupts briefly whilst you load the next data, then disable them. That way your interrupt routine will only overwrite screen data that's about to be redrawn anyway. But I suspect changing SP too frequently for the next destination address will cancel out any gains you get. This technique is best for things like clearing or copying large areas.


Title: Re: Optimisation Corner
Post by: redbox on 13:29, 04 June 14
Quote from: ralferoo on 12:38, 04 June 14
Using HL as the destination also allows you to do POP BC ; LD (HL),C ; INC L ; LD (HL),B ; INC L
The order is best experimented with. But essentially, BC when popped, C is the lowest address, B is the next address. If you're popping, the order increases upwards, so is more "normal".

That's quite a speed saving.

What's the best way to store the SP and then restore it again at the end of the routine?  I assume I use a DI at the start and then EI at the end too...?
Title: Re: Optimisation Corner
Post by: Urusergi on 14:52, 04 June 14
Quote from: redbox on 13:29, 04 June 14
What's the best way to store the SP and then restore it again at the end of the routine?

Maybe...

           LD HL,0
           ADD HL,SP

          ............

          LD SP,HL

Title: Re: Optimisation Corner
Post by: SyX on 14:54, 04 June 14
Took directly from pacman code:

; DE : Screen address
; HL : Tile address
draw_tile
    DI
    LD   (.sm_old_stack + 1),SP
    LD   SP,HL
    EX   DE,HL

    ; 1º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    SET  3,H
    ; 2º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  4,H
    ; 4º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    RES  3,H
    ; 3º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  5,H
    ; 7º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    RES  4,H
    ; 5º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  3,H
    ; 6º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    SET  4,H
    ; 8º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E

.sm_old_stack
    LD   SP,$0000
    EI
Title: Re: Optimisation Corner
Post by: redbox on 19:43, 04 June 14
Quote from: SyX on 14:54, 04 June 14
Took directly from pacman code:

Very nice, always like a bit of self modifying code as well  :)



Title: Re: Optimisation Corner
Post by: redbox on 20:00, 04 June 14
Okay, have put this together and here is our first optimised routine:


;; FnDrawTileStack

;; Description: Draw a tile using the stack
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       320 NOPs

;; Entry:       DE - screen address, HL - tile data
;; Exit:        None
;; Corrupt:     BC, DE, HL
       
        di
       
        ld      (FnDrawTileStack_OldStack + 1),sp
        ld      sp,hl
        ex      hl,de
       
        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 1
        pop de : ld (hl),e : inc l : ld (hl),d
       
        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 2
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 4
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 3
        pop de : ld (hl),e : dec l : ld (hl),d
       
        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 7
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 8
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 6
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 5
        pop de : ld (hl),e : dec l : ld (hl),d

        ld     bc,&e040                ; HL here is e.g. &E000, we want to...
        add    hl,bc                   ; ...get to &C040, so we add &E040

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 9
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 10
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 12
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 11
        pop de : ld (hl),e : dec l : ld (hl),d

        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 15
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 16
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 14
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 13
        pop de : ld (hl),e : dec l : ld (hl),d

FnDrawTileStack_OldStack:
        ld      sp,&0000
        ei
                                   
        ret


By my calculation, this routine takes only 80% of the time of the original.

"That, as they say, is speedy"  :D

Title: Re: Optimisation Corner
Post by: redbox on 15:36, 05 June 14
I have another problem which could be somewhat optimised...

I am double buffering using &8000 and &c000 for the two screens (so I can keep &4000 to &7fff free to bank in via C4, C5 etc).

I therefore have two variables holding the high bytes (&80 and &c0) for my hidden and visible screens to use in my drawing routines.  Swapping them over is easy as I just XOR &40 both of them.  Converting them to CRTC base is also easy to just doing SRL A twice (&80 -> &20 and &c0 -> &30).

Problem I have is marking the tiles in my dirty tilemap and relating them to which screen needs to be drawn.

If I have &80 in the tilemap this equates to %10XXXXXX.  And &c0 equates to %11XXXXXX.  So when I scan the tilemap, find &c0, draw the tile, then reset the &c0 I could also erase the &80 marker too.

Is there any smart bitwise way to do this?  Or am I stuck having to use two different buffers for the two screens (which is, of course, massively not optimised)...?
Title: Re: Optimisation Corner
Post by: Axelay on 00:26, 06 June 14
Are you able to sub &40 from the byte at the start of the process so the two chek bytes become &40 and &80 and only use a bit each?
Title: Re: Optimisation Corner
Post by: redbox on 07:50, 06 June 14
Quote from: Axelay on 00:26, 06 June 14
Are you able to sub &40 from the byte at the start of the process so the two chek bytes become &40 and &80 and only use a bit each?

Yes had thought about this so I can store the &80 and &40 as markers in the tilemap (%10XXXXXX and %01XXXXXX respectively).

So now I will have the ScrHidden variable which is either &c0 or &80.  I subtract &40 so it becomes &80 or &40.

I then have TilemapAttrib variable which can be &80 = %10000000 (draw to &c0), &40 = %01000000 (draw to &80) or &c0 = %11000000 (draw to both).

How do I work out which one to draw to, remove the bit switch and leave the other intact (for the next frame) if it's still there...?
Title: Re: Optimisation Corner
Post by: Axelay on 09:05, 06 June 14
It's not as simple as just XORing off the &40 or &80?


Or, could you do a single check check for whether it's &80 or &40 at the beginning of the process, and from that point just use BIT 6/7 checks and RES 6/7 to clear them, maybe even the same loop with some self modifying code to alter the BIT & RES instructions?
Title: Re: Optimisation Corner
Post by: arnoldemu on 09:16, 06 June 14
If you're storing a bit per screen anyway, why not have two of these buffers?

keeps the code simpler, same storage space (assuming you're also packing these into a byte).
Title: Re: Optimisation Corner
Post by: redbox on 11:23, 06 June 14
Quote from: Axelay on 09:05, 06 June 14
It's not as simple as just XORing off the &40 or &80?

Yes, that would remove it from the TilemapAttrib and leave the other one intact if it is there.

But how do I detect it (so I can print a tile)...?  Using XOR just sets the Z flag, which obviously won't be true if both bits are set.

I can't use a CP &80 or CP &40 because both bits might be set which means it's &c0.

At the moment I have something like this, which just works if &80 is found:


    ld    h,DtBufferTilemapAttrib/256       
    ld    l,&00               
    ld    a,(hl)
    bit    7,a
    call    nz,FnDrawTilemap_Draw


But what I need is something like this:


    ld    h,DtBufferTilemapAttrib/256       
    ld    l,&00               
    ld    a,(hl)
    ... compare with (ScrHidden-&40) ...
    call   z,FnDrawTilemap_Draw          ; if true, draw the tile
    xor    A with (ScrHidden-&40)        ; remove just this switch
    ld     (hl),a                        ; store it back


Quote from: arnoldemu on 09:16, 06 June 14
If you're storing a bit per screen anyway, why not have two of these buffers?
keeps the code simpler, same storage space (assuming you're also packing these into a byte).

Because I want to optimise it  ;)
Title: Re: Optimisation Corner
Post by: ralferoo on 12:37, 06 June 14
Quote from: redbox on 11:23, 06 June 14
Quote from: arnoldemu on 09:16, 06 June 14
If you're storing a bit per screen anyway, why not have two of these buffers?
Because I want to optimise it  ;)
I'm with arnoldemu on this. If you're double buffering, I think it's best to maintain two lists, for what you've drawn on each screen and how to erase it and consider each screen as if it was single buffered, just with double the timestep between each... i.e. don't even consider the other screen.

Otherwise, you'll find yourself getting very confused about what you're doing, you'll have to trawl through a big list and check these bits, etc.

In my experience, optimising for old platforms is about simplifying the main task, not adding more complexity to it! :)
Title: Re: Optimisation Corner
Post by: arnoldemu on 12:56, 06 June 14
Quote from: redbox on 11:23, 06 June 14
Because I want to optimise it  ;)

if you have two buffers it becomes more simple.

you can detect 8 horizontal tiles are dirty or not with a simple


ld a,(hl)
or a


then you can shift the bits to find which are actually dirty.

the only problem you may have is when hardware scrolling. here you want to dirty the tiles you just drawn so that they are also redrawn on the new back screen and vice versa.

Title: Re: Optimisation Corner
Post by: redbox on 13:34, 06 June 14
Quote from: arnoldemu on 12:56, 06 June 14
if you have two buffers it becomes more simple.

Okay, I'll use two buffers then (although I was tempted to use Axelay's self-modifying version!).

At the moment, I'm accessing buffer one like this:


    ld    h,MyBuffer_Scr1/256       
    ld    l,element
    ld    a,(hl)


Obviously if I have MyBuffer_Scr1 and MyBuffer_Scr2, I'll need a pointer (MyBuffer_Pointer) like this:


MyBuffer_Pointer:
    defw MyBuffer_Scr1


What's the fastest way to swap the pointer between MyBuffer_Scr1 and MyBuffer_Scr2 each frame? 

And is the best way to access the pointer and load it into H this: LD A,(MyBuffer_Pointer/256) : LD H,A...?
Title: Re: Optimisation Corner
Post by: Axelay on 13:37, 06 June 14
Quote from: redbox on 11:23, 06 June 14
Yes, that would remove it from the TilemapAttrib and leave the other one intact if it is there.

But how do I detect it (so I can print a tile)...?  Using XOR just sets the Z flag, which obviously won't be true if both bits are set.

I can't use a CP &80 or CP &40 because both bits might be set which means it's &c0.

At the moment I have something like this, which just works if &80 is found:




Hmm, well the obvious one is to use AND instead, but that destroys the rest of A so I'm assuming you want to preserve the rest of A or you'd be using that?  Tried to think of something that uses carry instead, but nothing particularly good has leapt out at me.  If it's the speed of the comparison, then you could use two steps on your base value of &c0 and &80.  SUB &40 and then RLCA, so you are using bit 0 and 7 rather than 7 and 6, then your loop needs either RLCA or RRCA and call c,FnDrawTilemap_Draw after which will be a fast check, but if the point of preserving the other bits was that they are holding information you want then they'll be in different positions for each buffer so that's only really moving the problem elsewhere.


Um.... what about.... 
ld    h,DtBufferTilemapAttrib/256
ld    l,&00               
bit    7,(hl) ; or 6, could use self mod code to change
call   nz,FnDrawTilemap_Draw          ; if true, draw the tile

So if the byte in (hl) is storing other info you need, you move the loading of it into A into the  FnDrawTilemap_Draw routine, along with the writing back of (hl) if it isnt already. 

Title: Re: Optimisation Corner
Post by: redbox on 14:08, 06 June 14
Quote from: Axelay on 13:37, 06 June 14
Hmm, well the obvious one is to use AND instead, but that destroys the rest of A so I'm assuming you want to preserve the rest of A or you'd be using that?

Ideally yes, because then I have the option to put the switches into the tilemap, e.g. use %00XXXXXX and then I can still store 64 tiles and the switches.

Quote from: Axelay on 13:37, 06 June 14
Um.... what about.... 
bit    7,(hl) ; or 6, could use self mod code to change

Or I guess I could just use the self-modifying code and then I wouldn't have to worry about turning &c0/&80 into anything - I could just use the switches above (i.e. %00XXXXXX) and then something like this:


        ld      a,(DtGlobalVar_BaseHidden)      ; get screen base we're writing too
        cp      &c0                             ; is it &c0? then we want to BIT/RES bit 7
        jr      z,screen_is_c0
                                                ; else it's &80 so we want to BIT/RES bit 6
        ld      hl,&77cb                        ; bit 6,a
        ld      (FnDrawTilemap_Loop_Bit),hl
        ld      hl,&b7cb                        ; res 6,a
        ld      (FnDrawTilemap_Loop_Res),hl
        jr      start_it                   

screen_is_c0:       
        ld      hl,&7fcb                        ; bit 7,a
        ld      (FnDrawTilemap_Loop_Bit),hl
        ld      hl,&bfcd                        ; res 7,a
        ld      (FnDrawTilemap_Loop_Res),hl               

start_it:
       .....


What do you think?
Title: Re: Optimisation Corner
Post by: Axelay on 14:56, 06 June 14
Yes, that.  :)  Although, the CB in the BIT/RES instructions never changes & the difference in the part of the instruction that does change is identical, so for a tiny optimisation I think something like this should be doing the same thing:




        ld      a,(DtGlobalVar_BaseHidden)      ; get screen base we're writing too
        cp      &c0                             ; is it &c0? then we want to BIT/RES bit 7
        jr      z,screen_is_c0
                                                ; else it's &80 so we want to BIT/RES bit 6
        ld      a,&77                        ; bit 6,a
        jr      start_it                   

screen_is_c0:       
        ld      a,&7f                        ; bit 7,a

start_it:
         ld      (FnDrawTilemap_Loop_Bit+1),a
         xor a,&c0 ; change &7x to &bx
         ld      (FnDrawTilemap_Loop_Res+1),a
       .....



Not sure if it's any faster, but it saves a couple of bytes at least.
Title: Re: Optimisation Corner
Post by: redbox on 18:22, 06 June 14
Quote from: Axelay on 14:56, 06 June 14
Yes, that.  :)  Although, the CB in the BIT/RES instructions never changes & the difference in the part of the instruction that does change is identical

Sounds good to me and is the option I've chosen (I'm a sucker for some self modifying code  ;) ).

I can always fall back on using the two buffers should things get complicated later on...


Title: Re: Optimisation Corner
Post by: redbox on 20:04, 06 June 14
...which leads me onto:

Quote from: arnoldemu on 12:56, 06 June 14
you can detect 8 horizontal tiles are dirty or not with a simple


ld a,(hl)
or a


then you can shift the bits to find which are actually dirty.

That's interesting as OR A is very quick. Interesting that a CP NN is faster than a BIT N,r too..

What do you mean by "shift the bits"? Do you mean compact the tilemap bitwise somehow?

Title: Re: Optimisation Corner
Post by: Rhino on 10:55, 07 June 14
Quote from: redbox on 20:00, 04 June 14
Okay, have put this together and here is our first optimised routine:


;; FnDrawTileStack

;; Description: Draw a tile using the stack
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       320 NOPs

;; Entry:       DE - screen address, HL - tile data
;; Exit:        None
;; Corrupt:     BC, DE, HL
       
        di
       
        ld      (FnDrawTileStack_OldStack + 1),sp
        ld      sp,hl
        ex      hl,de
       
        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 1
        pop de : ld (hl),e : inc l : ld (hl),d
       
        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 2
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 4
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 3
        pop de : ld (hl),e : dec l : ld (hl),d
       
        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 7
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 8
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 6
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 5
        pop de : ld (hl),e : dec l : ld (hl),d

        ld     bc,&e040                ; HL here is e.g. &E000, we want to...
        add    hl,bc                   ; ...get to &C040, so we add &E040

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 9
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 10
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 12
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 11
        pop de : ld (hl),e : dec l : ld (hl),d

        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 15
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 16
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 14
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 13
        pop de : ld (hl),e : dec l : ld (hl),d

FnDrawTileStack_OldStack:
        ld      sp,&0000
        ei
                                   
        ret


By my calculation, this routine takes only 80% of the time of the original.

"That, as they say, is speedy"  :D


Hi,

Two quick ideas to accelerate it a bit more:

The first one is about the character skip:

ld     bc,&e040
add   hl,bc

If I'm right, just before this code, you have the bit 5 = 1 in H, and bits 4 and 3 = 0. Well, this is not a problem if you start drawing the bottom row of characters in that order (starting with bit 5 set). So you should only change L here, and, because tiles are 16x16, a set 6,L should work instead of add &40.
But, if for some reason, you need to have both character rows in the same order, you could do:

res 5,H
set 6,L

This only saves 4 cycles in the first case and 2 in the second, but the more interesting part is that BC is not corrupted now (and maybe you can do some additional optimization in the routine that calls to draw tiles).

The second idea is about this line at the beginning of your routine:

ld (FnDrawTileStack_OldStack + 1), sp

In most cases, SP will have the same value when the routine is called, and would be a useless consumption be storing the same value for each tile.
In that case, you can store SP only once at the beginning of the/each routine that calls to draw tiles. For example, if the routine that calls to draw tiles has this structure:

1. preparation
2. loop body...
3. call(s) drawTile
4. continue loop body...
5. conditional jump to 2.
6. end and ret

In 1. preparation, you can do something like this:

push hl
ld (FnDrawTileStack_OldStack + 1), sp
pop hl

or this, if you can modify HL here:

ld hl,-2
add hl,sp
ld (FnDrawTileStack_OldStack + 1), hl

And save 6 additional cycles per tile. Usually, much time is lost in these processes of subroutine calls, preparations and returns. And possibly more optimizations can be made depending of the code that calls these critical routines.

Regards
Title: Re: Optimisation Corner
Post by: redbox on 21:30, 07 June 14
Quote from: Rhino on 10:55, 07 June 14
So you should only change L here, and, because tiles are 16x16, a set 6,L should work instead of add &40.

I like that, nice find.  Will add it to the routine and re-import my tiles with the altered line order for the second char line.

Quote from: Rhino on 10:55, 07 June 14
In most cases, SP will have the same value when the routine is called, and would be a useless consumption be storing the same value for each tile.
Usually, much time is lost in these processes of subroutine calls, preparations and returns. And possibly more optimizations can be made depending of the code that calls these critical routines.

I agree totally, but when I am writing something like a game (which I am) I always find the modular approach works during development, much like coding with Functions (hence the prefix Fn for my subroutines).

When everything is finished and working, then I appraise the code again and see where optimisations can be made in the setup and loading of registers during the flow of all subroutines.  If I do it now and I change something or when something breaks, it's a nightmare to fix  ;)

But I will bear in mind your SP recommendation when I get to this point, thanks  :)
Title: Re: Optimisation Corner
Post by: Rhino on 14:03, 08 June 14
Quote from: redbox on 21:30, 07 June 14
I like that, nice find.  Will add it to the routine and re-import my tiles with the altered line order for the second char line.

I agree totally, but when I am writing something like a game (which I am) I always find the modular approach works during development, much like coding with Functions (hence the prefix Fn for my subroutines).

When everything is finished and working, then I appraise the code again and see where optimisations can be made in the setup and loading of registers during the flow of all subroutines.  If I do it now and I change something or when something breaks, it's a nightmare to fix  ;)

But I will bear in mind your SP recommendation when I get to this point, thanks  :)

You're welcome. :)
I agree, from certain levels, an optimized code can be hard to understand. However, there are mechanisms that assist with this problem, such as using macros to encapsulate the complex parts in something readable.
Ultimately, it is a question of what we seek for our developments. Personally, and for the case of CPC (a machine not very well squeezed by the commercial software developers in terms of speed, IMO), I think speed should be a priority.

About leave optimizations to the end, I believe it is interesting for certain levels of optimization. But for deeper optimization involving collateral cost on things like memory, may not be possible to implement in the final step if you have not had the global perspective of these costs from the initial analysis.

Analyzing a little more the problem of quick tiles on CPC, I think saving a 15-20% additional of speed regarding your optimized routine is possible without big collateral costs. And, if you are interested, I can try to explain it here.
Title: Re: Optimisation Corner
Post by: SyX on 14:53, 08 June 14
Quote from: Rhino on 14:03, 08 June 14
Analyzing a little more the problem of quick tiles on CPC, I think saving a 15-20% additional of speed regarding your optimized routine is possible without big collateral costs. And, if you are interested, I can try to explain it here.
No te cortes un pelo y dadnos otra lección maestra... ehem, don't be shy and explain that magic trick  ;D
Title: Re: Optimisation Corner
Post by: Rhino on 16:39, 09 June 14
Ok, here we go, (calculations are done quickly and may not be exact):

3 things that can be exploited to draw tiles:

1) Usually there tiles that appear more frequently than others.
2) Usually there tiles with a single color, or simple meshes patterns.
3) In the gfxs of tiles, there are byte values ​​that are repeated frequently.

For example:

(http://www.joycogames.com/download/addams_family.png)

* "Addams family": Brick tile is repeated frequently (about 40-50%). Full black and green bytes very common in tiles.

(http://www.joycogames.com/download/gamber.png)

* "La Guerra de Gamber": Single color tiles repeated frequently (black, blue, about 20-30%), analyzing a complex tile, it is found that there are many repetitions of the same bytes.


So the idea is to use a generic routine only to draw the tiles less common, and specific fast routines for common tiles.

For example, imagine we have 256 different tiles, of which 16 of them are repeated more frequently and, in practice, they are about the 30% of the tiles that are drawn on the screen. These very common 16 tiles can be drawn with its own routine, allowing further optimization. For example, this code used for the generic routine:

pop de : ld (hl),e : inc l : ld (hl),d : inc l
pop de : ld (hl),e : inc l : ld (hl),d

Could be replaced by:

ld (hl),xx: inc l: ld (hl),xx: inc l
ld (hl),xx: inc l: ld (hl),xx

Saving 2 cycles per line of tile = 32 + 13 cycles by not having to manage the stack.

In addition, a few extra cycles can be saved due to the repeated bytes in the gfxs of the tiles.
For example, because we do not use DE now, if a byte is repeated more than 2 times in the gfx tile, we can do LD D,XX and replace the corresponding LD (HL),XX to LD (HL),D. And if you have a couple of bytes that are repeated more than 2 times, do LD DE,XXXX. In the example of the brick tile of "La Guerra de Gamber", we could do LD DE,xxxx with the values for the bytes repeated 11 and 8 times.

Besides this, if we can use BC, the two most common byte values in general can be stored in BC in advance. So that is not even necessary to load those values in DE ​​for each draw routine. A candidate for this is the 0 value (background color that in "La Guerra de Gamber" was repeated 8 times in the brick tile and a very common color in general).

By my calculations, to draw one of these tiles without any repeated byte we need 272 cycles. If we consider the common bytes in DE and BC, I think about 260 cycles per tile may be a conservative estimate.

But for some exceptional tiles that consist of a single color or simple meshes, the routine can be much faster, for example something like this:

Tile of a single color stored in BC:

ex hl,de

ld (hl),b : inc l : ld (hl),b : inc l
ld (hl),b : inc l : ld (hl),b : inc l

set 3,h

di

ld sp,hl
push bc
push bc

set 4,h

ld sp,hl
push bc
push bc
...

This draws a tile in 200 cycles aprox.

The downside to this speed increase is the additional memory required by the code of the specific routines that draw the 16 most frequent tiles, which by my calculations are about 2kb.

Now, we just have to call the appropriate routine according to the tile should be drawn with a generic or specific routine. The idea is to replace the pointer to the tile gfx by a pointer to the routine that draws the tile, for which we need a pointer for each routine that draws a tile.

The implementation of this is closely related to the way in which the tile index are stored in the tile map. If a tile is stored as 2 bytes, may be the exact pointer to the draw routine, which would be faster. And if a tile is stored as 1 byte, we will need to do a little calculation to convert the value from 0-255 to the pointer to the routine (in the same way that was done to get the tile gfx pointer).

Assuming that a tile is stored as 1 byte, and for there to be a pointer for each tile, it takes a jump table for the 256 tiles. In this table, we can put first the less common 240 tiles that are drawn with the generic routine, and then the 16 most common tiles that are drawn with specific routines, leaving for last the most common tile of all. This table would be something like this:

; Draw the tile number 1 with the generic routine

align 8

draw_tile1:
  di
  ld sp,xx  ; xx is the pointer to the gfx of the tile number 1
  jp (ix)     ; ix is the pointer to the generic routine

; Draw the tile number 2 with the generic routine

align 8

draw_tile2:
  di
  ld sp,xx
  jp (ix)

...

; Draw the tile number 240 with the specific routine

align 8

draw_tile240:
  jp draw_tile240_implementation

...

; Draw the tile number 255 with the specific routine

align 8

draw_tile255:
; jp not required here

...

For 256 tiles, this table requires 2kb.

Then, this is the routine that draws the tiles:

FnDrawTilemap_Draw:
  jp (hl)

So that when you call FnDrawTilemap_Draw, it jump to (hl), which is the pointer to the table described above. These jumps need 4 additional cycles for both generic and specific routines (except for the most common tile that we put at the end of the table, where only 1 additional cycle is required).

Now, to evaluate the result, let's summarize the costs and estimate speed savings rate on several assumptions.

* Cost for 16 optimized tiles:

- Additional Memory = 4kb.
- BC and IX usage.

* Estimation of average speed for each routine:

- Generic routine = 314 cycles (assuming it's your same routine with the optimizations mentioned in the previous post and adding the 4 additional cycles required by the jump table).
- Specific routines = 264 cycles.
- Specific routines for one color or simple meshes = 201 cycles.


* Speed estimation for 70% of less common tiles, 30% of optimized tiles (of which 5% are tiles of a single color or mesh):

70% -> 314 cycles
25% -> 264 cycles
5% -> 201 cycles
---------------
Average -> 296 cycles

* Speed estimation for 50% of less common tiles, 50% of optimized tiles (of which 10% are tiles of a single color or mesh):

50% -> 314 cycles
40% -> 264 cycles
10% -> 201 cycles
---------------
Average -> 283 cycles

* Speed estimation for 33% of less common tiles, 66% of optimized tiles (of which 33% are tiles of a single color or mesh):

33% -> 314 cycles
33% -> 264 cycles
33% -> 201 cycles
---------------
Average -> 260 cycles

If the main priority is the speed, we can even determine the design of the tiles to force an extreme optimization case.

Regards and sorry for the brick.

Rhino
Title: Re: Optimisation Corner
Post by: arnoldemu on 17:49, 09 June 14
Operation Wolf uses methods like these.

The lower half of the scrolling area has a fixed pattern I think.
So custom tile drawing method here.

The upper half also generally repeats. Probably just does a wrap scroll.

All these are excellent if you have enough ram.

The beat em up I am working on is already using loads of ram because of the sheer amount of animation moves.

Approx 40K is already in use for animation frames. Another 10K is for map.
I need 2 *16KB double buffer, that's another 32k. I need some space for music - another 6k/8k.

I also need space for good ai.

So it's going to be a tough one to optimise ;)





Title: Re: Optimisation Corner
Post by: Rhino on 23:58, 09 June 14
Quote from: arnoldemu on 17:49, 09 June 14
So it's going to be a tough one to optimise ;)

Well, those are the challenges we like :)

BTW, note that the 2kb required for the jump table has many unused bytes that can be used to store variables or other data.
So the really additional memory required could be less than 4kb. As it is also possible to use the amount of optimized tiles that best suits your needs.

However, implementing this would require a nice work in tool, in order to calculate the most frequently repeated tiles, and the repeated bytes in the tiles too, to export the tiles as optimized code.
Title: Re: Optimisation Corner
Post by: CraigsBar on 00:04, 10 June 14
Quote from: Rhino on 23:58, 09 June 14
Well, those are the challenges we like :)

Or rather, once those challenges are overcome, those are the results we like.

Craig.

Sent from my HTC One_M8 using Tapatalk

Title: Re: Optimisation Corner
Post by: redbox on 10:03, 10 June 14
Quote from: Rhino on 16:39, 09 June 14
So the idea is to use a generic routine only to draw the tiles less common, and specific fast routines for common tiles.

So by using a combination of compiled tiles and a fast stack based generic tile routine we reduce our average considerably.

I was wondering how to jump to the relevant routines but you've explained it nicely with the jumptable idea and how to structure the jumps after that.

I will build a complete tile routine as an example and when finished post it here.
Powered by SMFPacks Menu Editor Mod