Partial Masked Sprite copy to another : optimization

Arnaud · 20:22, 13 November 15

Hello,
i have written a function to make double buffer for a part of my screen.

I draw the room background and the content in a back buffer sprite then i draw it in the screen with cpctelera sprite function.
I also use double buffer to make a scrolling between two rooms.

Here my function to copy masked sprite to my backbuffer, i call it for each row of the sprite to be copied :

Code Select


void BlitMasked(UCHAR* pSpriteDest, UCHAR* pSpriteSrc, UCHAR pWidth)
{
    UCHAR posPixDest;
    UCHAR byteMask, byteColor;
    
    for (posPixDest = 0; posPixDest < pWidth; posPixDest++)
    {
        byteMask = *pSprite++;
        byteColor = *pSprite++;

        pDest[posPixDest] = (pDest[posPixDest] & byteMask) | byteColor;
    }
}

How can optimize this function to run faster ?

Thanks.

arnoldemu · 18:16, 14 November 15

converting it to asm would make it faster.

If you want to keep it as c:

Code Select


void BlitMasked(UCHAR* pSpriteDest, UCHAR* pSpriteSrc, UCHAR pWidth)
{
    UCHAR posPixDest;
    UCHAR byteMask, byteColor;
    int count = pWidth;
    UCHAR *pDestPtr = pDest; /* take initial copy */

    /* compare count against 0 */
    while (count!=0)
    {
        byteMask = *pSprite++;
        byteColor = *pSprite++;

        *pDestPtr = (*pDestPtr & byteMask) | byteColor;
        ++pDestPtr; /* increment it for each byte */
        --count; /* decrement count */
    }
}

Changes:
1. count down. It is quicker to compare against 0 than to compare against a number. A equal or not equal comparison is also quicker than a more complex comparison such as less, greater etc.
2. Keep a local pointer for pDestPtr, initialise it and increment it.
It is quicker to use a pointer like this rather than pDest[index]. pDest[index] (for a uchar pointer is equivalent to *(pDest+index). By using pointer you are avoiding that add. For uword it is *(pDest+(index<<1)). Here avoiding an add and a shift/multiply by two.
3. Sometimes it is quicker to use pre-increment/pre-decrement. Not so sure on sdcc and z80.

For asm the code looks a bit like this:

BC is your count set to width and decremented each time around the loop
DE is pDestPtr
HL is your pSprite

Code Select


loop:
ld a,(de) ;; same as *pDestPtr
and (hl)  ;; same as & *pSprite
inc hl ;; same as ++pSprite;
or (hl) ;; same as | *pSprite
inc hl ;; same as ++pSprite;
ld (de),a ;; same as *pDestPtr = 
inc de ;; same as ++pDestPtr
dec bc ;; same as --count;
jp nz,loop ;; same as while (count!=0)

Arnaud · 18:59, 14 November 15

Thanks a lot, i'll study your solutions with care.

cpcitor · 10:58, 07 December 15

Quote from: arnoldemu on 18:16, 14 November 15
Code Select Expand
dec bc ;; same as --count; jp nz,loop ;; same as while (count!=0)

Warning: dec bc does not set the Z flag.

So the loop will not work as expected.

References, including solutions:

Z80 Instruction Set - WikiTI says "dec Q" does not change any flag, Q means 16-bit register
Infinite Loop states the problem
16-bit counters offers a classical solution
Looping with 16 bit counter offers a faster variant (the principle is good, not sure their code is correct, though, seems to be missing some parts).

Executioner · 11:17, 07 December 15

Given that the original code had count as UCHAR (8 bit unsigned?), the count could just be in register B and DJNZ could be used.

Ast · 11:48, 07 December 15

Quote
BC is your count set to width and decremented each time around the loop

Use a simple B instead of BC because I don't think width will be superior to 255 chars !
Then B will be your counter and you'll have to replace "Dec bc : jr nz,xxxx" by a simple "Djnz xxxx"

....

Urusergi · 16:15, 07 December 15

Quote from: cpcitor on 10:58, 07 December 15

Looping with 16 bit counter offers a faster variant (the principle is good, not sure their code is correct, though, seems to be missing some parts).[/l][/l]

It works perfectly but is outdated. My optimization is here (in Variable length loops):
Z80 programming techniques - Loops[/list]

Executioner · 00:58, 08 December 15

Quote from: Urusergi on 16:15, 07 December 15
It works perfectly but is outdated. My optimization is here (in Variable length loops):
Z80 programming techniques - Loops

Interesting, but why use B and D instead of B and C, ld bc,#0a03 would loop #20a times. This does become a little tricky to convert from an initial value in BC since you can't swap the registers B and C, so you'd have to do ld a,b:inc a:ld b,c:ld b,a or something.

I'm also not sure why you're bothering disabling interrupts before the self modifying code? Even if your interrupt routine could corrupt the A register, you'd need the DI before you set it.

News:

Partial Masked Sprite copy to another : optimization