Author Topic: Optimisation Corner  (Read 4659 times)

0 Members and 1 Guest are viewing this topic.

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Optimisation Corner
« on: 12:17, 04 June 14 »
Thought I'd start a new thread to discuss ways to optimise routines.

Handy that, as I have one to start it off...  :D

Here is my current tile drawing routine:

Code: [Select]
;; FnDrawTile

;; Description: Draw a tile
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile data must be aligned to boundary (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       401 NOPs

;; Entry:       HL - screen address, DE - tile data
;; Exit:        None
;; Corrupt:     A, BC, DE, HL
       
    ld a,(de) : ld (hl),a : inc e : inc l   ; line 1
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 2
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 4
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 3
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e
       
    set    5,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 7
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 8
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    res    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 6
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 5
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    ld     bc,&e040                ; HL here is e.g. &E000, we want to...
    add    hl,bc                   ; ...get to &C040, so we add &E040

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 9
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 10
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 12
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 11
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    set    5,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 15
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    set     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 16
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e

    res    4,h

    ld a,(de) : ld (hl),a : inc e : inc l    ; line 12
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e : inc l   
    ld a,(de) : ld (hl),a : inc e

    res     3,h

    ld a,(de) : ld (hl),a : inc e : dec l    ; line 13
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a : inc e : dec l
    ld a,(de) : ld (hl),a
                               
    ret

Currently I have to draw up to 4 of these per sprite (20x16 size) per frame using the dirty tilemap method, so any way of making it faster..?

I was thinking of using the stack and SP, but am unsure of how the tile data order would differ.



« Last Edit: 13:13, 04 June 14 by redbox »

Offline ralferoo

  • Supporter
  • 6128 Plus
  • *
  • Posts: 969
  • Country: gb
  • Liked: 583
  • Likes Given: 222
Re: Optimisation Corner
« Reply #1 on: 14:38, 04 June 14 »
The ones with "inc l" and "inc e" you might as well use ldir for - it's 5us instead of 6us for the separate instructions, obviously corrupts C and maybe B.

Using the stack is fine, but you have to consider certain things.... You'd probably want to use the stack as the source, so then you can just pop and carry on doing manipulations on HL (because setting the SP requires going through HL or an explicit load). Using HL as the destination also allows you to do POP BC ; LD (HL),C ; INC L ; LD (HL),B ; INC L

The order is best experimented with. But essentially, BC when popped, C is the lowest address, B is the next address. If you're popping, the order increases upwards, so is more "normal".

Pushing to the screen is a good technique if you need to keep interrupts enabled most of the time, as long as you know that your interrupt handler won't write too much onto the stack. So, e.g. once per line, you could update SP to the next bit of data to write to, enable interrupts briefly whilst you load the next data, then disable them. That way your interrupt routine will only overwrite screen data that's about to be redrawn anyway. But I suspect changing SP too frequently for the next destination address will cancel out any gains you get. This technique is best for things like clearing or copying large areas.



Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #2 on: 15:29, 04 June 14 »
Using HL as the destination also allows you to do POP BC ; LD (HL),C ; INC L ; LD (HL),B ; INC L
The order is best experimented with. But essentially, BC when popped, C is the lowest address, B is the next address. If you're popping, the order increases upwards, so is more "normal".

That's quite a speed saving.

What's the best way to store the SP and then restore it again at the end of the routine?  I assume I use a DI at the start and then EI at the end too...?

Offline Urusergi

  • CPC6128
  • ****
  • Posts: 222
  • Country: es
  • Liked: 433
  • Likes Given: 1469
Re: Optimisation Corner
« Reply #3 on: 16:52, 04 June 14 »
What's the best way to store the SP and then restore it again at the end of the routine?

Maybe...

Code: [Select]
           LD HL,0
           ADD HL,SP

          ............

          LD SP,HL

Offline SyX

  • 6128 Plus
  • ******
  • Posts: 1.129
  • Country: br
  • Liked: 1121
  • Likes Given: 1871
Re: Optimisation Corner
« Reply #4 on: 16:54, 04 June 14 »
Took directly from pacman code:
Code: [Select]
; DE : Screen address
; HL : Tile address
draw_tile
    DI
    LD   (.sm_old_stack + 1),SP
    LD   SP,HL
    EX   DE,HL

    ; 1º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    SET  3,H
    ; 2º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  4,H
    ; 4º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    RES  3,H
    ; 3º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  5,H
    ; 7º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    RES  4,H
    ; 5º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E
    SET  3,H
    ; 6º Scanline
    POP  DE
    LD   (HL),E
    INC  L
    LD   (HL),D
    SET  4,H
    ; 8º Scanline
    POP  DE
    LD   (HL),D
    DEC  L
    LD   (HL),E

.sm_old_stack
    LD   SP,$0000
    EI

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #5 on: 21:43, 04 June 14 »
Took directly from pacman code:

Very nice, always like a bit of self modifying code as well  :)



« Last Edit: 21:54, 04 June 14 by redbox »

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #6 on: 22:00, 04 June 14 »
Okay, have put this together and here is our first optimised routine:

Code: [Select]
;; FnDrawTileStack

;; Description: Draw a tile using the stack
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       320 NOPs

;; Entry:       DE - screen address, HL - tile data
;; Exit:        None
;; Corrupt:     BC, DE, HL
       
        di
       
        ld      (FnDrawTileStack_OldStack + 1),sp
        ld      sp,hl
        ex      hl,de
       
        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 1
        pop de : ld (hl),e : inc l : ld (hl),d
       
        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 2
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 4
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 3
        pop de : ld (hl),e : dec l : ld (hl),d
       
        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 7
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 8
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 6
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 5
        pop de : ld (hl),e : dec l : ld (hl),d

        ld     bc,&e040                ; HL here is e.g. &E000, we want to...
        add    hl,bc                   ; ...get to &C040, so we add &E040

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 9
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 10
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 12
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 11
        pop de : ld (hl),e : dec l : ld (hl),d

        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 15
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 16
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 14
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 13
        pop de : ld (hl),e : dec l : ld (hl),d

FnDrawTileStack_OldStack:
        ld      sp,&0000
        ei
                                   
        ret

By my calculation, this routine takes only 80% of the time of the original.

"That, as they say, is speedy"  :D


Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #7 on: 17:36, 05 June 14 »
I have another problem which could be somewhat optimised...

I am double buffering using &8000 and &c000 for the two screens (so I can keep &4000 to &7fff free to bank in via C4, C5 etc).

I therefore have two variables holding the high bytes (&80 and &c0) for my hidden and visible screens to use in my drawing routines.  Swapping them over is easy as I just XOR &40 both of them.  Converting them to CRTC base is also easy to just doing SRL A twice (&80 -> &20 and &c0 -> &30).

Problem I have is marking the tiles in my dirty tilemap and relating them to which screen needs to be drawn.

If I have &80 in the tilemap this equates to %10XXXXXX.  And &c0 equates to %11XXXXXX.  So when I scan the tilemap, find &c0, draw the tile, then reset the &c0 I could also erase the &80 marker too.

Is there any smart bitwise way to do this?  Or am I stuck having to use two different buffers for the two screens (which is, of course, massively not optimised)...?

Offline Axelay

  • 6128 Plus
  • ******
  • Posts: 584
  • Country: au
  • Liked: 383
  • Likes Given: 87
Re: Optimisation Corner
« Reply #8 on: 02:26, 06 June 14 »
Are you able to sub &40 from the byte at the start of the process so the two chek bytes become &40 and &80 and only use a bit each?

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #9 on: 09:50, 06 June 14 »
Are you able to sub &40 from the byte at the start of the process so the two chek bytes become &40 and &80 and only use a bit each?

Yes had thought about this so I can store the &80 and &40 as markers in the tilemap (%10XXXXXX and %01XXXXXX respectively).

So now I will have the ScrHidden variable which is either &c0 or &80.  I subtract &40 so it becomes &80 or &40.

I then have TilemapAttrib variable which can be &80 = %10000000 (draw to &c0), &40 = %01000000 (draw to &80) or &c0 = %11000000 (draw to both).

How do I work out which one to draw to, remove the bit switch and leave the other intact (for the next frame) if it's still there...?

Offline Axelay

  • 6128 Plus
  • ******
  • Posts: 584
  • Country: au
  • Liked: 383
  • Likes Given: 87
Re: Optimisation Corner
« Reply #10 on: 11:05, 06 June 14 »
It's not as simple as just XORing off the &40 or &80?


Or, could you do a single check check for whether it's &80 or &40 at the beginning of the process, and from that point just use BIT 6/7 checks and RES 6/7 to clear them, maybe even the same loop with some self modifying code to alter the BIT & RES instructions?

Offline arnoldemu

  • Supporter
  • 6128 Plus
  • *
  • Posts: 5.336
  • Country: gb
    • Unofficial Amstrad WWW Resource
  • Liked: 2274
  • Likes Given: 3478
Re: Optimisation Corner
« Reply #11 on: 11:16, 06 June 14 »
If you're storing a bit per screen anyway, why not have two of these buffers?

keeps the code simpler, same storage space (assuming you're also packing these into a byte).
My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #12 on: 13:23, 06 June 14 »
It's not as simple as just XORing off the &40 or &80?

Yes, that would remove it from the TilemapAttrib and leave the other one intact if it is there.

But how do I detect it (so I can print a tile)...?  Using XOR just sets the Z flag, which obviously won't be true if both bits are set.

I can't use a CP &80 or CP &40 because both bits might be set which means it's &c0.

At the moment I have something like this, which just works if &80 is found:

Code: [Select]
    ld    h,DtBufferTilemapAttrib/256       
    ld    l,&00               
    ld    a,(hl)
    bit    7,a
    call    nz,FnDrawTilemap_Draw

But what I need is something like this:

Code: [Select]
    ld    h,DtBufferTilemapAttrib/256       
    ld    l,&00               
    ld    a,(hl)
    ... compare with (ScrHidden-&40) ...
    call   z,FnDrawTilemap_Draw          ; if true, draw the tile
    xor    A with (ScrHidden-&40)        ; remove just this switch
    ld     (hl),a                        ; store it back

If you're storing a bit per screen anyway, why not have two of these buffers?
keeps the code simpler, same storage space (assuming you're also packing these into a byte).

Because I want to optimise it  ;)

Offline ralferoo

  • Supporter
  • 6128 Plus
  • *
  • Posts: 969
  • Country: gb
  • Liked: 583
  • Likes Given: 222
Re: Optimisation Corner
« Reply #13 on: 14:37, 06 June 14 »
If you're storing a bit per screen anyway, why not have two of these buffers?
Because I want to optimise it  ;)
I'm with arnoldemu on this. If you're double buffering, I think it's best to maintain two lists, for what you've drawn on each screen and how to erase it and consider each screen as if it was single buffered, just with double the timestep between each... i.e. don't even consider the other screen.

Otherwise, you'll find yourself getting very confused about what you're doing, you'll have to trawl through a big list and check these bits, etc.

In my experience, optimising for old platforms is about simplifying the main task, not adding more complexity to it! :)

Offline arnoldemu

  • Supporter
  • 6128 Plus
  • *
  • Posts: 5.336
  • Country: gb
    • Unofficial Amstrad WWW Resource
  • Liked: 2274
  • Likes Given: 3478
Re: Optimisation Corner
« Reply #14 on: 14:56, 06 June 14 »
Because I want to optimise it  ;)

if you have two buffers it becomes more simple.

you can detect 8 horizontal tiles are dirty or not with a simple

Code: [Select]
ld a,(hl)
or a

then you can shift the bits to find which are actually dirty.

the only problem you may have is when hardware scrolling. here you want to dirty the tiles you just drawn so that they are also redrawn on the new back screen and vice versa.

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #15 on: 15:34, 06 June 14 »
if you have two buffers it becomes more simple.

Okay, I'll use two buffers then (although I was tempted to use Axelay's self-modifying version!).

At the moment, I'm accessing buffer one like this:

Code: [Select]
    ld    h,MyBuffer_Scr1/256       
    ld    l,element
    ld    a,(hl)

Obviously if I have MyBuffer_Scr1 and MyBuffer_Scr2, I'll need a pointer (MyBuffer_Pointer) like this:

Code: [Select]
MyBuffer_Pointer:
    defw MyBuffer_Scr1

What's the fastest way to swap the pointer between MyBuffer_Scr1 and MyBuffer_Scr2 each frame? 

And is the best way to access the pointer and load it into H this: LD A,(MyBuffer_Pointer/256) : LD H,A...?

Offline Axelay

  • 6128 Plus
  • ******
  • Posts: 584
  • Country: au
  • Liked: 383
  • Likes Given: 87
Re: Optimisation Corner
« Reply #16 on: 15:37, 06 June 14 »
Yes, that would remove it from the TilemapAttrib and leave the other one intact if it is there.

But how do I detect it (so I can print a tile)...?  Using XOR just sets the Z flag, which obviously won't be true if both bits are set.

I can't use a CP &80 or CP &40 because both bits might be set which means it's &c0.

At the moment I have something like this, which just works if &80 is found:




Hmm, well the obvious one is to use AND instead, but that destroys the rest of A so I'm assuming you want to preserve the rest of A or you'd be using that?  Tried to think of something that uses carry instead, but nothing particularly good has leapt out at me.  If it's the speed of the comparison, then you could use two steps on your base value of &c0 and &80.  SUB &40 and then RLCA, so you are using bit 0 and 7 rather than 7 and 6, then your loop needs either RLCA or RRCA and call c,FnDrawTilemap_Draw after which will be a fast check, but if the point of preserving the other bits was that they are holding information you want then they'll be in different positions for each buffer so that's only really moving the problem elsewhere.


Um.... what about.... 
Code: [Select]
ld    h,DtBufferTilemapAttrib/256
ld    l,&00               
bit    7,(hl) ; or 6, could use self mod code to change
call   nz,FnDrawTilemap_Draw          ; if true, draw the tile
So if the byte in (hl) is storing other info you need, you move the loading of it into A into the  FnDrawTilemap_Draw routine, along with the writing back of (hl) if it isnt already. 


Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #17 on: 16:08, 06 June 14 »
Hmm, well the obvious one is to use AND instead, but that destroys the rest of A so I'm assuming you want to preserve the rest of A or you'd be using that?

Ideally yes, because then I have the option to put the switches into the tilemap, e.g. use %00XXXXXX and then I can still store 64 tiles and the switches.

Um.... what about.... 
bit    7,(hl) ; or 6, could use self mod code to change

Or I guess I could just use the self-modifying code and then I wouldn't have to worry about turning &c0/&80 into anything - I could just use the switches above (i.e. %00XXXXXX) and then something like this:

Code: [Select]
        ld      a,(DtGlobalVar_BaseHidden)      ; get screen base we're writing too
        cp      &c0                             ; is it &c0? then we want to BIT/RES bit 7
        jr      z,screen_is_c0
                                                ; else it's &80 so we want to BIT/RES bit 6
        ld      hl,&77cb                        ; bit 6,a
        ld      (FnDrawTilemap_Loop_Bit),hl
        ld      hl,&b7cb                        ; res 6,a
        ld      (FnDrawTilemap_Loop_Res),hl
        jr      start_it                   

screen_is_c0:       
        ld      hl,&7fcb                        ; bit 7,a
        ld      (FnDrawTilemap_Loop_Bit),hl
        ld      hl,&bfcd                        ; res 7,a
        ld      (FnDrawTilemap_Loop_Res),hl               

start_it:
       .....

What do you think?

Offline Axelay

  • 6128 Plus
  • ******
  • Posts: 584
  • Country: au
  • Liked: 383
  • Likes Given: 87
Re: Optimisation Corner
« Reply #18 on: 16:56, 06 June 14 »
Yes, that.  :)  Although, the CB in the BIT/RES instructions never changes & the difference in the part of the instruction that does change is identical, so for a tiny optimisation I think something like this should be doing the same thing:


Code: [Select]

        ld      a,(DtGlobalVar_BaseHidden)      ; get screen base we're writing too
        cp      &c0                             ; is it &c0? then we want to BIT/RES bit 7
        jr      z,screen_is_c0
                                                ; else it's &80 so we want to BIT/RES bit 6
        ld      a,&77                        ; bit 6,a
        jr      start_it                   

screen_is_c0:       
        ld      a,&7f                        ; bit 7,a

start_it:
         ld      (FnDrawTilemap_Loop_Bit+1),a
         xor a,&c0 ; change &7x to &bx
         ld      (FnDrawTilemap_Loop_Res+1),a
       .....


Not sure if it's any faster, but it saves a couple of bytes at least.

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #19 on: 20:22, 06 June 14 »
Yes, that.  :)  Although, the CB in the BIT/RES instructions never changes & the difference in the part of the instruction that does change is identical

Sounds good to me and is the option I've chosen (I'm a sucker for some self modifying code  ;) ).

I can always fall back on using the two buffers should things get complicated later on...


 
« Last Edit: 21:58, 06 June 14 by redbox »

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #20 on: 22:04, 06 June 14 »
...which leads me onto:

you can detect 8 horizontal tiles are dirty or not with a simple

Code: [Select]
ld a,(hl)
or a

then you can shift the bits to find which are actually dirty.

That's interesting as OR A is very quick. Interesting that a CP NN is faster than a BIT N,r too..

What do you mean by "shift the bits"? Do you mean compact the tilemap bitwise somehow?

« Last Edit: 23:25, 07 June 14 by redbox »

Offline Rhino

  • CPC6128
  • ****
  • Posts: 291
  • Liked: 870
  • Likes Given: 366
Re: Optimisation Corner
« Reply #21 on: 12:55, 07 June 14 »
Okay, have put this together and here is our first optimised routine:

Code: [Select]
;; FnDrawTileStack

;; Description: Draw a tile using the stack
;; Conditions:  Screen address must be mutiple of 8 (character height)
;;              Screen must be 32x32 character screen (INCs are 8-bit)
;;              Tile size is set to 16x16 (4 bytes by 16 lines)
;; Notes:       320 NOPs

;; Entry:       DE - screen address, HL - tile data
;; Exit:        None
;; Corrupt:     BC, DE, HL
       
        di
       
        ld      (FnDrawTileStack_OldStack + 1),sp
        ld      sp,hl
        ex      hl,de
       
        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 1
        pop de : ld (hl),e : inc l : ld (hl),d
       
        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 2
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 4
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 3
        pop de : ld (hl),e : dec l : ld (hl),d
       
        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 7
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 8
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 6
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 5
        pop de : ld (hl),e : dec l : ld (hl),d

        ld     bc,&e040                ; HL here is e.g. &E000, we want to...
        add    hl,bc                   ; ...get to &C040, so we add &E040

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 9
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 10
        pop de : ld (hl),e : dec l : ld (hl),d

        set    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 12
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 11
        pop de : ld (hl),e : dec l : ld (hl),d

        set    5,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 15
        pop de : ld (hl),e : inc l : ld (hl),d

        set     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 16
        pop de : ld (hl),e : dec l : ld (hl),d

        res    4,h

        pop de : ld (hl),e : inc l : ld (hl),d : inc l  ; line 14
        pop de : ld (hl),e : inc l : ld (hl),d

        res     3,h

        pop de : ld (hl),e : dec l : ld (hl),d : dec l  ; line 13
        pop de : ld (hl),e : dec l : ld (hl),d

FnDrawTileStack_OldStack:
        ld      sp,&0000
        ei
                                   
        ret

By my calculation, this routine takes only 80% of the time of the original.

"That, as they say, is speedy"  :D


Hi,

Two quick ideas to accelerate it a bit more:

The first one is about the character skip:

ld     bc,&e040
add   hl,bc

If I'm right, just before this code, you have the bit 5 = 1 in H, and bits 4 and 3 = 0. Well, this is not a problem if you start drawing the bottom row of characters in that order (starting with bit 5 set). So you should only change L here, and, because tiles are 16x16, a set 6,L should work instead of add &40.
But, if for some reason, you need to have both character rows in the same order, you could do:

res 5,H
set 6,L

This only saves 4 cycles in the first case and 2 in the second, but the more interesting part is that BC is not corrupted now (and maybe you can do some additional optimization in the routine that calls to draw tiles).

The second idea is about this line at the beginning of your routine:

ld (FnDrawTileStack_OldStack + 1), sp

In most cases, SP will have the same value when the routine is called, and would be a useless consumption be storing the same value for each tile.
In that case, you can store SP only once at the beginning of the/each routine that calls to draw tiles. For example, if the routine that calls to draw tiles has this structure:

1. preparation
2. loop body...
3. call(s) drawTile
4. continue loop body...
5. conditional jump to 2.
6. end and ret

In 1. preparation, you can do something like this:

push hl
ld (FnDrawTileStack_OldStack + 1), sp
pop hl

or this, if you can modify HL here:

ld hl,-2
add hl,sp
ld (FnDrawTileStack_OldStack + 1), hl

And save 6 additional cycles per tile. Usually, much time is lost in these processes of subroutine calls, preparations and returns. And possibly more optimizations can be made depending of the code that calls these critical routines.

Regards
« Last Edit: 12:58, 07 June 14 by Rhino »

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
  • Likes Given: 267
Re: Optimisation Corner
« Reply #22 on: 23:30, 07 June 14 »
So you should only change L here, and, because tiles are 16x16, a set 6,L should work instead of add &40.

I like that, nice find.  Will add it to the routine and re-import my tiles with the altered line order for the second char line.

In most cases, SP will have the same value when the routine is called, and would be a useless consumption be storing the same value for each tile.
Usually, much time is lost in these processes of subroutine calls, preparations and returns. And possibly more optimizations can be made depending of the code that calls these critical routines.

I agree totally, but when I am writing something like a game (which I am) I always find the modular approach works during development, much like coding with Functions (hence the prefix Fn for my subroutines).

When everything is finished and working, then I appraise the code again and see where optimisations can be made in the setup and loading of registers during the flow of all subroutines.  If I do it now and I change something or when something breaks, it's a nightmare to fix  ;)

But I will bear in mind your SP recommendation when I get to this point, thanks  :)

Offline Rhino

  • CPC6128
  • ****
  • Posts: 291
  • Liked: 870
  • Likes Given: 366
Re: Optimisation Corner
« Reply #23 on: 16:03, 08 June 14 »
I like that, nice find.  Will add it to the routine and re-import my tiles with the altered line order for the second char line.

I agree totally, but when I am writing something like a game (which I am) I always find the modular approach works during development, much like coding with Functions (hence the prefix Fn for my subroutines).

When everything is finished and working, then I appraise the code again and see where optimisations can be made in the setup and loading of registers during the flow of all subroutines.  If I do it now and I change something or when something breaks, it's a nightmare to fix  ;)

But I will bear in mind your SP recommendation when I get to this point, thanks  :)

You're welcome. :)
I agree, from certain levels, an optimized code can be hard to understand. However, there are mechanisms that assist with this problem, such as using macros to encapsulate the complex parts in something readable.
Ultimately, it is a question of what we seek for our developments. Personally, and for the case of CPC (a machine not very well squeezed by the commercial software developers in terms of speed, IMO), I think speed should be a priority.

About leave optimizations to the end, I believe it is interesting for certain levels of optimization. But for deeper optimization involving collateral cost on things like memory, may not be possible to implement in the final step if you have not had the global perspective of these costs from the initial analysis.

Analyzing a little more the problem of quick tiles on CPC, I think saving a 15-20% additional of speed regarding your optimized routine is possible without big collateral costs. And, if you are interested, I can try to explain it here.

Offline SyX

  • 6128 Plus
  • ******
  • Posts: 1.129
  • Country: br
  • Liked: 1121
  • Likes Given: 1871
Re: Optimisation Corner
« Reply #24 on: 16:53, 08 June 14 »
Analyzing a little more the problem of quick tiles on CPC, I think saving a 15-20% additional of speed regarding your optimized routine is possible without big collateral costs. And, if you are interested, I can try to explain it here.
No te cortes un pelo y dadnos otra lección maestra... ehem, don't be shy and explain that magic trick  ;D