Ok, here we go, (calculations are done quickly and may not be exact):

3 things that can be exploited to draw tiles:

1) Usually there tiles that appear more frequently than others.

2) Usually there tiles with a single color, or simple meshes patterns.

3) In the gfxs of tiles, there are byte values that are repeated frequently.

For example:

* "Addams family": Brick tile is repeated frequently (about 40-50%). Full black and green bytes very common in tiles.

* "La Guerra de Gamber": Single color tiles repeated frequently (black, blue, about 20-30%), analyzing a complex tile, it is found that there are many repetitions of the same bytes.

So the idea is to use a generic routine only to draw the tiles less common, and specific fast routines for common tiles.

For example, imagine we have 256 different tiles, of which 16 of them are repeated more frequently and, in practice, they are about the 30% of the tiles that are drawn on the screen. These very common 16 tiles can be drawn with its own routine, allowing further optimization. For example, this code used for the generic routine:

pop de : ld (hl),e : inc l : ld (hl),d : inc l

pop de : ld (hl),e : inc l : ld (hl),d

Could be replaced by:

ld (hl),xx: inc l: ld (hl),xx: inc l

ld (hl),xx: inc l: ld (hl),xx

Saving 2 cycles per line of tile = 32 + 13 cycles by not having to manage the stack.

In addition, a few extra cycles can be saved due to the repeated bytes in the gfxs of the tiles.

For example, because we do not use DE now, if a byte is repeated more than 2 times in the gfx tile, we can do LD D,XX and replace the corresponding LD (HL),XX to LD (HL),D. And if you have a couple of bytes that are repeated more than 2 times, do LD DE,XXXX. In the example of the brick tile of "La Guerra de Gamber", we could do LD DE,xxxx with the values for the bytes repeated 11 and 8 times.

Besides this, if we can use BC, the two most common byte values in general can be stored in BC in advance. So that is not even necessary to load those values in DE for each draw routine. A candidate for this is the 0 value (background color that in "La Guerra de Gamber" was repeated 8 times in the brick tile and a very common color in general).

By my calculations, to draw one of these tiles without any repeated byte we need 272 cycles. If we consider the common bytes in DE and BC, I think about 260 cycles per tile may be a conservative estimate.

But for some exceptional tiles that consist of a single color or simple meshes, the routine can be much faster, for example something like this:

Tile of a single color stored in BC:

ex hl,de

ld (hl),b : inc l : ld (hl),b : inc l

ld (hl),b : inc l : ld (hl),b : inc l

set 3,h

di

ld sp,hl

push bc

push bc

set 4,h

ld sp,hl

push bc

push bc

...

This draws a tile in 200 cycles aprox.

The downside to this speed increase is the additional memory required by the code of the specific routines that draw the 16 most frequent tiles, which by my calculations are about 2kb.

Now, we just have to call the appropriate routine according to the tile should be drawn with a generic or specific routine. The idea is to replace the pointer to the tile gfx by a pointer to the routine that draws the tile, for which we need a pointer for each routine that draws a tile.

The implementation of this is closely related to the way in which the tile index are stored in the tile map. If a tile is stored as 2 bytes, may be the exact pointer to the draw routine, which would be faster. And if a tile is stored as 1 byte, we will need to do a little calculation to convert the value from 0-255 to the pointer to the routine (in the same way that was done to get the tile gfx pointer).

Assuming that a tile is stored as 1 byte, and for there to be a pointer for each tile, it takes a jump table for the 256 tiles. In this table, we can put first the less common 240 tiles that are drawn with the generic routine, and then the 16 most common tiles that are drawn with specific routines, leaving for last the most common tile of all. This table would be something like this:

; Draw the tile number 1 with the generic routine

align 8

draw_tile1:

di

ld sp,xx ; xx is the pointer to the gfx of the tile number 1

jp (ix) ; ix is the pointer to the generic routine

; Draw the tile number 2 with the generic routine

align 8

draw_tile2:

di

ld sp,xx

jp (ix)

...

; Draw the tile number 240 with the specific routine

align 8

draw_tile240:

jp draw_tile240_implementation

...

; Draw the tile number 255 with the specific routine

align 8

draw_tile255:

; jp not required here

...

For 256 tiles, this table requires 2kb.

Then, this is the routine that draws the tiles:

FnDrawTilemap_Draw:

jp (hl)

So that when you call FnDrawTilemap_Draw, it jump to (hl), which is the pointer to the table described above. These jumps need 4 additional cycles for both generic and specific routines (except for the most common tile that we put at the end of the table, where only 1 additional cycle is required).

Now, to evaluate the result, let's summarize the costs and estimate speed savings rate on several assumptions.

* Cost for 16 optimized tiles:

- Additional Memory = 4kb.

- BC and IX usage.

* Estimation of average speed for each routine:

- Generic routine = 314 cycles (assuming it's your same routine with the optimizations mentioned in the previous post and adding the 4 additional cycles required by the jump table).

- Specific routines = 264 cycles.

- Specific routines for one color or simple meshes = 201 cycles.

* Speed estimation for 70% of less common tiles, 30% of optimized tiles (of which 5% are tiles of a single color or mesh):

70% -> 314 cycles

25% -> 264 cycles

5% -> 201 cycles

---------------

Average -> 296 cycles

* Speed estimation for 50% of less common tiles, 50% of optimized tiles (of which 10% are tiles of a single color or mesh):

50% -> 314 cycles

40% -> 264 cycles

10% -> 201 cycles

---------------

Average -> 283 cycles

* Speed estimation for 33% of less common tiles, 66% of optimized tiles (of which 33% are tiles of a single color or mesh):

33% -> 314 cycles

33% -> 264 cycles

33% -> 201 cycles

---------------

Average -> 260 cycles

If the main priority is the speed, we can even determine the design of the tiles to force an extreme optimization case.

Regards and sorry for the brick.

Rhino