News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_Arnaud

[CPCTelera] Trying to make a Gauntlet like engine

Started by Arnaud, 11:29, 25 February 18

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Arnaud

Hello,
i'm working with CPCTelera on a game engine similar to "Gauntlet" with the same constraints:

       
  • Tile are square 8x16
  • Background is solid color
  • No transparency for most of sprites
My problem is my engine is really slower than the original and i'd try to found the main reason. What is remarkable is that Gauntlet is always fast (with or without scrolling), even if the screen is full of monsters.
It's in these cases that i see Gauntlet was really well programmed (and more when compared to my own code  ;) )

I read this post carefully: http://www.cpcwiki.eu/forum/games/gauntlet-how-come-cpc-version-manages-to-run-without-sluggish-sprites/ and tried a few tricks but with no real success compared to Gauntlet.

Here the context:

       
  • One level has a size of 32x32 tiles
  • The view has 20x10 tiles to draw (Gauntlet has 16x10)
  • Monsters and projectiles are drawn separately and have the same size as tiles
  • The entire view (not screen) is deleted with the background color at each frame
From a technical point of view:

       
  • Double buffer is used
  • The cpct_drawSprite function is used for tiles and cpct_drawTileAligned4x8 / cpct_drawTileAligned4x4f for specific cases
  • To delete the view I use the fastest method of CPCTelera, cpct_memset_f64, which uses the Stack but disables interruptions that I would need for the music/sound for example.
It's in this case that I have the best performances.

What I've tried:

       
  • Drawing empty tiles with color block for background -> work but incredibly slow
  • Use a buffer to keep the tiles visible on the screen to know which ones to delete. Not great at all, I had a 20x10 buffer to update and read at each frame. And in addition I had to manage the trail deletion of other elements (monster, projectiles) -> i was not able to code correctly, i give up because it was slow and take a lot of memory.
Here a simplified code of drawing map :

void DrawMap()
{       
    const u8* tileMap = GetStartingMap()

    for (u8 y = 0; y <= NB_TILES_CY; y++)   
   {           
        u8 cy = ComputeSizeYOfTile()
        u8* vmem = GetMemPointerOfCurrentLine()

        if (cy != 0)
        {                       
            for (u8 x = 1; x < NB_TILES_CX; x++)
            {
                u8 tile = *tileMap;               
                if (tile != DGN_GROUND)
                {               
                    u8* tileSprite = GetTileset(tile);
                    u8* video = vmem + x*TILE_CX;                       
                   
                    switch(cy)
                    {
                        case 8     : cpct_drawTileAligned4x8(tileSprite, video);        break;
                        case 4    : cpct_drawTileAligned4x4_f(tileSprite, video);        break;                           
                        default    : cpct_drawSprite(tileSprite, video, TILE_CX, cy);
                    }               
                }               
                tileMap++;
            }   
            tileMap++;
        }
        else
            tileMap += NB_TILES_CX;
       
        tileMap += (MAP_CX - NB_TILES_CX);
    }
}



What i was thinking:

       
  • Make a asm specific function for 8x16 sprite (i don't know if drawing optimization is possible for fixed sprite size)
  • Use a better implementation of dirty buffer
  • Handle monster and projectile as regular map tile
All ideas are welcome  :D

Thanks,
Arnaud.

Targhan

- Did you try disassemble the original engine to see how it works?
- Don't worry about the stack being used by your code: it is possible to overcome that. I don't remember the detail, but there is an interruption flag raised when an interruption SHOULD have been triggered. So after your "DI  / code using SP", you can do an EI, wait one nop, and your interruption code should start if it want meant to start. If you do this often, your interruption code will run just fine, with maybe a small delay according to the duration of the code without interruption.

Targhan/Arkos

Arkos Tracker 2.0.1 now released! - Follow the news on Twitter!
Disark - A cross-platform Z80 disassembler/source converter
FDC Tool 1.1 - Read Amsdos files without the system

Imperial Mahjong
Orion Prime

Arnaud

Quote from: Targhan on 15:34, 25 February 18
- Did you try disassemble the original engine to see how it works?

I tried, but i'm not skilled enough in assembly to understand a whole game routine.

Quote from: Targhan on 15:34, 25 February 18
- Don't worry about the stack being used by your code: it is possible to overcome that. I don't remember the detail, but there is an interruption flag raised when an interruption SHOULD have been triggered. So after your "DI  / code using SP", you can do an EI, wait one nop, and your interruption code should start if it want meant to start. If you do this often, your interruption code will run just fine, with maybe a small delay according to the duration of the code without interruption.

Good news  :)


ronaldo

Checking Gauntlet 2 is not so difficult. Code es relatively easy to follow if you look for specific parts. For instance, you can verify that it actually erases buffers using a routine similar to cpct_memset_f64, which is located at 0CF6h in the code (you only need to run the game, bring up the debugger and put a breakpoint in there). However, this routine has several interesting differences:

       
  • It uses 32 PUSH DE consecutive instructions to clear 1 scanline at once (Gauntlet uses 32-character-wide screen mode)
  • It disables and reenables interrupts at the start and end of every scanline being cleared.
  • It takes around 25400 microseconds to clear the whole screen buffer (~1,27 VSYNCs)
  • It goes scanline by scanline in order, because it uses the spare bytes of memory to store sprites, so they cannot be deleted.
The main game loop starts at 0206h. I guess that first 4 calls are setting up drawing and modifying the score. The fifth call (to 04E8h) draws tiles that are not on the borders (that do not need clipping). Next  acalls do drawings that require clipping, then entities, entities with clipping, the player and, at the end, the shots.

Actuall drawing happens on several functions. The main one is at 0277h. Its a self-modifyable function that can draw 4xYY-byte sprites (16xYY pixels), with 2 <= YY <= 16, either character-aligned or on the middle of a character with some clever substitution of JP (IX) for 2 NOPs that lets the function return earlier of later. It uses JP (IX) to return because it makes use of the stack pointer to read the sprite using POP DE. So it reads 2 bytes at a time from the sprite and writes them with LD (HL), E: INC L: LD (HL), D. Then it uses ADD HL, BC, properly setting BC to jump from one scanline to the next. It also uses LD DE, #C83D: ADD HL, DE to jump from latest character scanline to next.

It is decently fast and configurable with self-modifying code. Clipping at the sides is easy due to the game constraints. Horizontal movements are byte-tied and vertical ones are 2-byte jumps to produce 2 pixels movement on either direction. Previous drawing function is easily used for up-down clipping putting JP(IX) before or after and then pointing to the start of the sprite or to the middle. A different function is used for left-right clipping with a loop at 04AFh.

All these functions rely heavily on self-modifying code. Functions have a setup stage in which they are modified, and then they are used many times for all the sprites: 1 modification for many sprites. The loop needs at least 5 set-ups: general sprites, and clipped sprites on 4 directions.
All this is introduced in a drawing routine that draws everything in approximately 27000-28000 microseconds. The whole game loop takes from 59000-64000 microseconds, yielding 15-17 FPS approximately.
In order to get there, your main points are as follows:

       
  • If you program in C, you need much more optimal code. Your drawing code will be very slow due to the compiler having to use stack/memory many times to save/retrieve your variables.
  • Even with a great C code, direct assembler will always have an edge.
  • Using standard video mode and drawing functions is slower than 32-bytes-wide mode, however, you need your own drawing functions for 32-byte-wide mode.
  • You may improve your drawing not having to check if tiles are background by having a tile-array with locations. Iterating over many tiles that are only background just to check and do nothing is quite slow.
You may get an approximation without too much effort if you make movements character-tied. With that you may get a first fast version in C, with standard CPCtelera (without adding new functions) and without modifying CRTC values (standard mode). You may start there and go designing improvements.
After seeing the drawing routine, I personally think that Gauntlet can be done much faster in the CPC, if you forget about some kind of easy portability to other platforms. In any case, it is actually quite good.

Arnaud

Thanks a lot @ronaldo to make the analysis of Gauntlet 2 code, it's really interesting.

As i was guessing, there're big speed gain to achieve, i'll work on your improvement suggestions.

And if i can write some assembly specific code, there are gains in speed as a result, i really have to think about.

Arnaud

Quote from: ronaldo on 18:51, 25 February 18
For instance, you can verify that it actually erases buffers using a routine similar to cpct_memset_f64, which is located at 0CF6h in the code (you only need to run the game, bring up the debugger and put a breakpoint in there). However, this routine has several interesting differences:

       
  • It uses 32 PUSH DE consecutive instructions to clear 1 scanline at once (Gauntlet uses 32-character-wide screen mode)
  • It disables and reenables interrupts at the start and end of every scanline being cleared.
  • It takes around 25400 microseconds to clear the whole screen buffer (~1,27 VSYNCs)
  • It goes scanline by scanline in order, because it uses the spare bytes of memory to store sprites, so they cannot be deleted.
I tried to make a similar function and i have a question :

When disable after reenable interruption shall the new value of SP be stored ? The value may have been modified during the interruption ?


DI
SAVE SP

WHILE
{
  ?? Need to store current value of SP ??

  DI
  USE STACK
  EI
}


Restore SP
Return


On other hand i don't really see a speed improvement from the original code  :( :

    u8 nLine;
    for (nLine = 0; nLine < 8; nLine++)
        cpct_memset_f64(vmem + nLine * 2048, 0, 0x640);

Docent

Quote from: Arnaud on 19:20, 26 February 18
I tried to make a similar function and i have a question :

When disable after reenable interruption shall the new value of SP be stored ? The value may have been modified during the interruption ?


DI
SAVE SP

WHILE
{
  ?? Need to store current value of SP ??

  DI
  USE STACK
  EI
}


Restore SP
Return


It should be more like this:

WHILE
{
  DI
  Save SP
  Clear line with SP
  Restore SP
  EI
  Calc next line offset etc.
}
Return


ronaldo

Quote from: Arnaud on 19:20, 26 February 18
On other hand i don't really see a speed improvement from the original code  :( :

    u8 nLine;
    for (nLine = 0; nLine < 8; nLine++)
        cpct_memset_f64(vmem + nLine * 2048, 0, 0x640);

If you are cleaning the whole 16K video memory, there is no speed gain for doing it by lines (in fact, it will be much slower). The technique from gauntlet is used because of 2 reasons: they don't want to clear the whole 16K (their video memory takes less space, and they don't want to clear the scoreboard), and they need interrupts to continue working because of their split screen.

They may have a small gain because of not clearing the whole screen and by having a 32-character-wide screen, but it isn't noticeable, because clearing the screen for them takes those enormous 25400 microseconds.

Arnaud

Hello,
i have improved my function to clear part of the screen (i fill only the centered 72-bytes per line).

Is someone can take a look to see if it can be optimized ? I think the dec sp could be removed, but i have not found how  ::) .

...
    ;; Border Right 4-bytes
    dec sp            ;; [2]
    dec sp            ;; [2]
    dec sp            ;; [2]
    dec sp            ;; [2]
   
    ;; Fill Center View 72-bytes
    push de           ;; [4] Push 8-bytes
    push de           ;; [4] 
    push de           ;; [4]
    push de           ;; [4]
...


Thanks,
Arnaud

Docent

Quote from: Arnaud on 09:16, 18 March 18
Hello,
i have improved my function to clear part of the screen (i fill only the centered 72-bytes per line).

Is someone can take a look to see if it can be optimized ? I think the dec sp could be removed, but i have not found how  ::) .

...
    ;; Border Right 4-bytes
    dec sp            ;; [2]
    dec sp            ;; [2]
    dec sp            ;; [2]
    dec sp            ;; [2]
   
    ;; Fill Center View 72-bytes
    push de           ;; [4] Push 8-bytes
    push de           ;; [4] 
    push de           ;; [4]
    push de           ;; [4]
...


Thanks,
Arnaud


ld bc, #0x0050-4 instead of ld  bc, #0x0050 at the beginning should do the trick.

Powered by SMFPacks Menu Editor Mod