Optimisation Corner

Rhino · 16:39, 09 June 14

Ok, here we go, (calculations are done quickly and may not be exact):

3 things that can be exploited to draw tiles:

1) Usually there tiles that appear more frequently than others.
2) Usually there tiles with a single color, or simple meshes patterns.
3) In the gfxs of tiles, there are byte values that are repeated frequently.

For example:

* "Addams family": Brick tile is repeated frequently (about 40-50%). Full black and green bytes very common in tiles.

* "La Guerra de Gamber": Single color tiles repeated frequently (black, blue, about 20-30%), analyzing a complex tile, it is found that there are many repetitions of the same bytes.

So the idea is to use a generic routine only to draw the tiles less common, and specific fast routines for common tiles.

For example, imagine we have 256 different tiles, of which 16 of them are repeated more frequently and, in practice, they are about the 30% of the tiles that are drawn on the screen. These very common 16 tiles can be drawn with its own routine, allowing further optimization. For example, this code used for the generic routine:

pop de : ld (hl),e : inc l : ld (hl),d : inc l
pop de : ld (hl),e : inc l : ld (hl),d

Could be replaced by:

ld (hl),xx: inc l: ld (hl),xx: inc l
ld (hl),xx: inc l: ld (hl),xx

Saving 2 cycles per line of tile = 32 + 13 cycles by not having to manage the stack.

In addition, a few extra cycles can be saved due to the repeated bytes in the gfxs of the tiles.
For example, because we do not use DE now, if a byte is repeated more than 2 times in the gfx tile, we can do LD D,XX and replace the corresponding LD (HL),XX to LD (HL),D. And if you have a couple of bytes that are repeated more than 2 times, do LD DE,XXXX. In the example of the brick tile of "La Guerra de Gamber", we could do LD DE,xxxx with the values for the bytes repeated 11 and 8 times.

Besides this, if we can use BC, the two most common byte values in general can be stored in BC in advance. So that is not even necessary to load those values in DE for each draw routine. A candidate for this is the 0 value (background color that in "La Guerra de Gamber" was repeated 8 times in the brick tile and a very common color in general).

By my calculations, to draw one of these tiles without any repeated byte we need 272 cycles. If we consider the common bytes in DE and BC, I think about 260 cycles per tile may be a conservative estimate.

But for some exceptional tiles that consist of a single color or simple meshes, the routine can be much faster, for example something like this:

Tile of a single color stored in BC:

ex hl,de

ld (hl),b : inc l : ld (hl),b : inc l
ld (hl),b : inc l : ld (hl),b : inc l

set 3,h

di

ld sp,hl
push bc
push bc

set 4,h

ld sp,hl
push bc
push bc
...

This draws a tile in 200 cycles aprox.

The downside to this speed increase is the additional memory required by the code of the specific routines that draw the 16 most frequent tiles, which by my calculations are about 2kb.

Now, we just have to call the appropriate routine according to the tile should be drawn with a generic or specific routine. The idea is to replace the pointer to the tile gfx by a pointer to the routine that draws the tile, for which we need a pointer for each routine that draws a tile.

The implementation of this is closely related to the way in which the tile index are stored in the tile map. If a tile is stored as 2 bytes, may be the exact pointer to the draw routine, which would be faster. And if a tile is stored as 1 byte, we will need to do a little calculation to convert the value from 0-255 to the pointer to the routine (in the same way that was done to get the tile gfx pointer).

Assuming that a tile is stored as 1 byte, and for there to be a pointer for each tile, it takes a jump table for the 256 tiles. In this table, we can put first the less common 240 tiles that are drawn with the generic routine, and then the 16 most common tiles that are drawn with specific routines, leaving for last the most common tile of all. This table would be something like this:

; Draw the tile number 1 with the generic routine

align 8

draw_tile1:
di
ld sp,xx ; xx is the pointer to the gfx of the tile number 1
jp (ix) ; ix is the pointer to the generic routine

; Draw the tile number 2 with the generic routine

align 8

draw_tile2:
di
ld sp,xx
jp (ix)

...

; Draw the tile number 240 with the specific routine

align 8

draw_tile240:
jp draw_tile240_implementation

...

; Draw the tile number 255 with the specific routine

align 8

draw_tile255:
; jp not required here

...

For 256 tiles, this table requires 2kb.

Then, this is the routine that draws the tiles:

FnDrawTilemap_Draw:
jp (hl)

So that when you call FnDrawTilemap_Draw, it jump to (hl), which is the pointer to the table described above. These jumps need 4 additional cycles for both generic and specific routines (except for the most common tile that we put at the end of the table, where only 1 additional cycle is required).

Now, to evaluate the result, let's summarize the costs and estimate speed savings rate on several assumptions.

* Cost for 16 optimized tiles:

- Additional Memory = 4kb.
- BC and IX usage.

* Estimation of average speed for each routine:

- Generic routine = 314 cycles (assuming it's your same routine with the optimizations mentioned in the previous post and adding the 4 additional cycles required by the jump table).
- Specific routines = 264 cycles.
- Specific routines for one color or simple meshes = 201 cycles.

* Speed estimation for 70% of less common tiles, 30% of optimized tiles (of which 5% are tiles of a single color or mesh):

70% -> 314 cycles
25% -> 264 cycles
5% -> 201 cycles
---------------
Average -> 296 cycles

* Speed estimation for 50% of less common tiles, 50% of optimized tiles (of which 10% are tiles of a single color or mesh):

50% -> 314 cycles
40% -> 264 cycles
10% -> 201 cycles
---------------
Average -> 283 cycles

* Speed estimation for 33% of less common tiles, 66% of optimized tiles (of which 33% are tiles of a single color or mesh):

33% -> 314 cycles
33% -> 264 cycles
33% -> 201 cycles
---------------
Average -> 260 cycles

If the main priority is the speed, we can even determine the design of the tiles to force an extreme optimization case.

Regards and sorry for the brick.

Rhino

arnoldemu · 17:49, 09 June 14

Operation Wolf uses methods like these.

The lower half of the scrolling area has a fixed pattern I think.
So custom tile drawing method here.

The upper half also generally repeats. Probably just does a wrap scroll.

All these are excellent if you have enough ram.

The beat em up I am working on is already using loads of ram because of the sheer amount of animation moves.

Approx 40K is already in use for animation frames. Another 10K is for map.
I need 2 *16KB double buffer, that's another 32k. I need some space for music - another 6k/8k.

I also need space for good ai.

So it's going to be a tough one to optimise

Rhino · 23:58, 09 June 14

Quote from: arnoldemu on 17:49, 09 June 14
So it's going to be a tough one to optimise

Well, those are the challenges we like

BTW, note that the 2kb required for the jump table has many unused bytes that can be used to store variables or other data.
So the really additional memory required could be less than 4kb. As it is also possible to use the amount of optimized tiles that best suits your needs.

However, implementing this would require a nice work in tool, in order to calculate the most frequently repeated tiles, and the repeated bytes in the tiles too, to export the tiles as optimized code.

CraigsBar · 00:04, 10 June 14

Quote from: Rhino on 23:58, 09 June 14
Well, those are the challenges we like

Or rather, once those challenges are overcome, those are the results we like.

Craig.

Sent from my HTC One_M8 using Tapatalk

redbox · 10:03, 10 June 14

Quote from: Rhino on 16:39, 09 June 14
So the idea is to use a generic routine only to draw the tiles less common, and specific fast routines for common tiles.

So by using a combination of compiled tiles and a fast stack based generic tile routine we reduce our average considerably.

I was wondering how to jump to the relevant routines but you've explained it nicely with the jumptable idea and how to structure the jumps after that.

I will build a complete tile routine as an example and when finished post it here.

News:

Optimisation Corner

Rhino

arnoldemu

Rhino

CraigsBar

redbox