Plus sprite upload tips

Cwiiis · 19:05, 14 January 25

Sprite upload on the Plus is slow. Currently, this is my sprite upload code-

        ld b,16       ; Copy 128 bytes
        ld sp,hl      ; Point stack at end of ASIC sprite
        ex de,hl
    .LoadSpriteCopyLoop
        repeat 8      ; Each iteration writes 1 byte
            ld a,(hl)   ; (2)
            ld d,a      ; (1)
            repeat 4: rrca: rend    ; (4)
            ld e,a      ; (1)
            push de     ; Write 2 nibbles (4)
            dec hl      ; (2)
        rend
        djnz .LoadSpriteCopyLoop

`de` points to the sprite address + `#7F`.

I could save cycles by storing sprites unpacked, or further unrolling the loop, but I'm wondering... Does anyone have any other smart ideas? Delta sprites are out of the question really, though I'm sure they'd potentially save a lot of time and memory. For the most part, program memory is not a premium, so I have space to spare as long as it isn't a ludicrous difference.

andycadley · 19:42, 14 January 25

Unpacking them is the way to go if you can justify the waste of RAM. Assuming your sprites are page aligned, you should also be able to replace DEC HL with just DEC L, since it will never cross a page boundary.

If they're in ROM, you might get away with RLD to fetch data although it will trash RAM underneath. Alternatively using a lookup table to do the rotate might save some time.

roudoudou · 22:05, 14 January 25

i use generated code to do that, it's approx 4 times faster (depending on data... sometimes more, sometimes less)

Cwiiis · 01:08, 15 January 25

Quote from: andycadley on 19:42, 14 January 25Unpacking them is the way to go if you can justify the waste of RAM. Assuming your sprites are page aligned, you should also be able to replace DEC HL with just DEC L, since it will never cross a page boundary.

If they're in ROM, you might get away with RLD to fetch data although it will trash RAM underneath. Alternatively using a lookup table to do the rotate might save some time.

The idea of a lookup table was interesting enough for me to have a play, but I couldn't figure out a way to use one that wasn't slower, even with self-modifying code... I ended up with this as the inner loop:

Code Select

            ld a,(de)                   ; (2)
            ld (@DoLookup+1),a          ; (2)
        @DoLookup ld hl,(PixelLookup)   ; (5)
            push hl     ; Write 2 nibbles (4)
            dec e       ; (1)

Which is actually 1 nop slower, and costs a 256 byte lookup table... Did I miss something? (very likely

)

Axelay · 08:46, 15 January 25

For a lookup table I think this might work and would be a little faster. You need to have the high byte of the lookup table loaded in C and I think you might need to have your pixels packed the other way round in the bytes to what you have in your first example.

Code Select

ld a,(de)
ld l,a
ld h,c
ld h,(hl)
push hl
dec e

Cwiiis · 12:29, 15 January 25

I've decided to go with generated code, though it makes the sprites about 3x bigger in memory on average, the speed benefit is undeniable!

My current generator is naive, it uses the 3 16-bit registers and loops through the bytes of the image in reverse order - if one of the registers is equal to the byte, it writes a push, if not it writes a load and a push and rotates the queue of registers for the next unloaded byte.

It occurs to me that this is essentially a very basic form of compression and that there are probably better strategies to find the optimal sequence of load/push. Does anyone have any good ideas in that regard?

I suppose it's such a small amount of data and an offline process that you could just brute-force to find the best series. At every encountering of a new byte, you have 3 choices you can make with regards to what register to load it into (and thusly, what previous byte to discard). 128 bytes per sprite, the pathological case would be about 2 million checks to brute-force it, I think? Which is probably pretty quick on modern hardware...

Edit: A quick note, the architecture of my program makes it undesirable to use shadow registers - but I'm still interested in hearing in other people's experiences and what they're doing, even without that constraint!

Cwiiis · 15:33, 15 January 25

Think I got my powers mixed up, you can't brute-force this

Well, maybe you could with a GPU, but that's getting extreme - I'm going to try to devise a few strategies and perhaps I can pick the best one depending on the input.

norecess464 · 16:12, 15 January 25

With Sonic GX, I noticed that there isn't a universal method for managing sprites on the GX-4000.

For example, in the Bonus Stages, I relied on DIFF (delta) because it was the fastest approach, with uninterrupted sprite animations.

However, during the main gameplay (the platformer section), the situation is different. Here, sprite animations can be interrupted at any moment, so we can't wait for the current animation to complete its sequence, making the Delta approach unsuitable.

For certain elements, I use generated code, as @roudoudou mentioned. However, it consumes memory quickly, so it must be used with care.

If you are living in France, make sure to have a look later this year at the next issue of CPC FANZ BZH #3, as I wrote an article for it about a convenient way of handling sprites in games. (Of course, I don't want to approach/spoil its content here in this post).

Cwiiis · 16:17, 15 January 25

I added the option in my sprite compiler to reserve 1 or 2 registers for the most frequently occurring pixels in a sprite, then I compare all 3 and choose the strategy that ends up with the smallest sprite in memory - with my dataset, that brings average sprite size down from 335 bytes to 287 bytes - a pretty decent saving

I expect a better algorithm would involve doing a forward-search to decide if saving a register is worthwhile or not, which I will attempt to eke out a few more bytes/nops...

edit: Implemented forward-search and fixed a bug that I was reserving registers before actually encountering the pixel - down from 287 average to 282. Every little counts, I suppose

roudoudou · 17:56, 15 January 25

Quote from: Cwiiis on 15:33, 15 January 25Think I got my powers mixed up, you can't brute-force this Well, maybe you could with a GPU, but that's getting extreme - I'm going to try to devise a few strategies and perhaps I can pick the best one depending on the input.

"you can't brut force this"

why?

Cwiiis · 18:01, 15 January 25

Quote from: roudoudou on 17:56, 15 January 25
Quote from: Cwiiis on 15:33, 15 January 25Think I got my powers mixed up, you can't brute-force this Well, maybe you could with a GPU, but that's getting extreme - I'm going to try to devise a few strategies and perhaps I can pick the best one depending on the input.
"you can't brut force this"

why?

Because of combinatorial explosion? I may have gotten my numbers wrong and/or my code wrong of course, but it seemed that way.

roudoudou · 18:09, 15 January 25

Quote from: Cwiiis on 18:01, 15 January 25
Quote from: roudoudou on 17:56, 15 January 25
Quote from: Cwiiis on 15:33, 15 January 25Think I got my powers mixed up, you can't brute-force this Well, maybe you could with a GPU, but that's getting extreme - I'm going to try to devise a few strategies and perhaps I can pick the best one depending on the input.
"you can't brut force this"

why?

Because of combinatorial explosion? I may have gotten my numbers wrong and/or my code wrong of course, but it seemed that way.

if you avoid creating absurd branches, the combinatory is small (and we have a tremendous amount of memory for that small sprites)

Cwiiis · 18:11, 15 January 25

Quote from: roudoudou on 18:09, 15 January 25
Quote from: Cwiiis on 18:01, 15 January 25
Quote from: roudoudou on 17:56, 15 January 25
Quote from: Cwiiis on 15:33, 15 January 25Think I got my powers mixed up, you can't brute-force this Well, maybe you could with a GPU, but that's getting extreme - I'm going to try to devise a few strategies and perhaps I can pick the best one depending on the input.
"you can't brut force this"

why?

Because of combinatorial explosion? I may have gotten my numbers wrong and/or my code wrong of course, but it seemed that way.

if you avoid creating absurd branches, the combinatory is small (and we have a tremendous amount of memory for that small sprites)

What do you consider an absurd branch? Obviously you only create branches when pixels change and aren't covered in existing registers, but even then the combinations are huge...

roudoudou · 18:13, 15 January 25

absurd branch : when you already have a register for the data, use the register, do not explore other cases
there is many cases when you do not have the register ready to use but you can be smart

and use some auto-purge if needed, you will gain speed+memory

Cwiiis · 18:18, 15 January 25

Quote from: roudoudou on 18:13, 15 January 25absurd branch : when you already have a register for the data, use the register, do not explore other cases
there is many cases when you do not have the register ready to use but you can be smart
and use some auto-purge if needed, you will gain speed+memory

Right, this isn't brute-force at this point, this is a heuristic

Limited depth, breadth-first and pruning the least successful branches might be a good way of limiting the search space. I wonder how much more efficient it would be vs what I'm doing at the moment though...

Cwiiis · 13:29, 16 January 25

After chatting with @roudoudou, he made me realise I wasn't considering 8-bit register loads in my compiler... Now the average is down from 282 bytes to 228 bytes(/612 nops) with the same dataset. I'm very pleased to have broken the 256 byte barrier - of course it depends on the input, but hopefully that ratio mostly holds and these compiled sprites are now smaller than unpacked sprite data in memory, as well as being several times faster

So my compiler now does this:
1. Hold a list of the value of hl, de, bc
2. Inspect the sprite as unpacked words backwards
3. Do any of the above registers hold the value? Write 'push <reg>', go to 8
4. Can we make up the value by doing 1 8-bit register load? Write 'ld x,x: push <reg>', go to 8
5. Can we make up the value by doing 1 8-bit value load? Write 'ld x,n: push <reg>', go to 8
6. Can we make up the value by doing 2 8-bit register loads? Write 'ld x,x: ld x,x: push <reg>', go to 8
7. Write 'ld xy,<value>: push <reg>'
8. Inspect next word

Registers that can be modified are kept in a list. When one is modified, it's removed and put at the back of the list. When modifications need to occur, the front of the list is favoured.

On top of that, there's a 'try harder mode' - This now introduces the concept of reserving registers for the most frequently occurring words. Before doing the above, it creates a list of the most frequently occurring words in the sprite. Before step 7, it inspects the value of the register and if it's one of the most frequently occurring, removes it from the list of registers that can be modified. Once there are no more of that frequently occurring pixel, it adds it back to the front of the can-modify list. The compiler will try compiling the sprite with 0, 1 and 2 reserved slots and use the best result.

I can see various avenues to shave off some more bytes, though I believe it's diminishing returns now and I'm happy to have a <256 byte average. For example, it may save some bytes to check how 'useful' the unreserved registers are to the data in the remainder of the sprite and favour modifying less useful registers. Similarly, the order of steps 5 and 6 may have an impact on how many instructions it takes to encode the rest of the sprite.

Cwiiis · 23:26, 17 January 25

Ok, I couldn't leave it alone without exploring a little further - first, I made the whole thing a bit more generic. Rather than a list of things it tries manually coded, I now have a list of functions that modify registers and record how many bytes/nops they take. To that list, as well as what I had before, I added 8-bit inc/dec and sra/sla. The former of those two new instructions shaved a couple of bytes, the latter nothing

Now I have a more generic way of both trying and adding instruction support, I added breadth-first brute-force searching. As the image is iterated, for each byte that isn't already in a register, it records every possible solution. Then before the next iteration, the current solutions are sorted by size and the list of solutions to search is truncated to the specified breadth.

Adding that brute-force search, a 10-wide search goes from 227 bytes to 213 bytes - a very decent saving! Of course it's a lot slower, but it executes in very reasonable time (a matter of seconds at that width with this dataset).

The tool doesn't combine multiple 1-byte instructions, but I don't think there's a combination that would do anything that an 8-bit value load wouldn't accomplish without trashing a register and it'd add some probably unnecessary complication (I manually added functions that do every valid combination of 2 8-bit register loads). I think I can probably leave it alone for now

Cwiiis · 12:11, 18 January 25

Quote from: Cwiiis on 23:26, 17 January 25The tool doesn't combine multiple 1-byte instructions, but I don't think there's a combination that would do anything that an 8-bit value load wouldn't accomplish without trashing a register and it'd add some probably unnecessary complication (I manually added functions that do every valid combination of 2 8-bit register loads). I think I can probably leave it alone for now

While I haven't done this, I did manually add some combinations that I realised would be effective (two inc/decs on the same 16-bit register, an inc/dec + a register load). I also added ordering of the reserved register list by remaining byte frequency. At a breadth of 20, I'm now down to 210 bytes/594 nops per sprite on average

The worst case in my current sprite sheet is 257 bytes/641 nops. I'm probably going to leave it there, and certainly stop posting about it unless I happen across a dramatic saving.

There are still instructions to integrate (shifts, res, set) but it's very much diminishing returns now, there are very few value loads in any of the generated instruction lists. I'll likely add the instruction combining next if I make any more changes, it'll be nice to reduce the code size and maybe find some of the combinations that I might've missed doing it manually.

zhulien · 22:47, 05 February 25

I was going to post another question about compiled sprites but the I saw this post too... maybe relevant.

Are there any good sprite compilers on cpc?

Or on another machine that targets the cpc?

Does it let you build variants of the sprite, e.g. mirrored or flipped or even 6128 plus, screen memory base, full overscan or vertical overscan, horizontal overscan

When compiled sprites are drawn into screen memory (not plus sprite memory), do they usually draw from bottom up or top down, do they do clipping vertically at all?

Has anyone considered a standardised spritebank api for compiled and even non compiled sprites on cpc... basically so you can dynamically load them into memory or extended memory, from that choose based on requirements immediate to use lists for faster immediate processing, and basically you can then execute each in turn as required. This could be extended for tilemaps also. There are lots of pros for spritebanks as they can have alternate sprites for standard cpc, plus machine and v9990.

andycadley · 23:19, 05 February 25

Generally you'd want a compiler tailored to the specific scenarios you need to consider. Things like clipping are expensive and complex for compiled sprites, so you're not going to want code doing that unless you absolutely have to.

You definitely wouldn't want any kind of abstraction layer, the primary reason for compiling sprites is to maximize performance and generalising things to work in any scenario is the absolute antithesis of that.

lightforce6128 · 03:04, 06 February 25

Quote from: andycadley on 23:19, 05 February 25Generally you'd want a compiler tailored to the specific scenarios you need to consider. ... You definitely wouldn't want any kind of abstraction layer ...

I agree. From a software engineering point of view it is always good to abstract from low level details. But from my experience, if you start from a well organized program with abstractions, you soon reach the point where you need to make the program faster or smaller or both, additionally there are some strange restrictions forced by hardware, and then slowly the well planned abstractions go away, leaving one with an unorganized mess of assembler code. Then I usually add a big bunch of comments to improve the situation slightly.

What I can imagine is some kind of sprite compiler library, extendable with plug-ins for dozens or hundreds of different scenarios. This would be a big software engineering project, what reminds me on the topic of model driven architecture with its pros and cons.

zhulien · 04:47, 06 February 25

Isn't that why you choose how you want them compiled from the compiler? I guess top or bottom clipping still involves a test. But still a good compilation option for when you need a spirite to e.g. die and vanish into the ground like a zombie or use the no clip version moving around the screen but use the clip version when moving up or down off the visible screen. The reason having the large spritebank, and creating the active list would speed things up a lot. To me the abstraction would be in different compiled versions not choosing at runtime (at least not in the action). No point having a v9990 srpitebank loaded when using crtc or plus sprites. But nothing wrong with a mixture if you need them all... all compiled as they may or may not need to be.

andycadley · 08:35, 06 February 25

You could move some things into the compiler and produce different target code. Vertical clipping is not too bad (though if sprites never clip vertically, it's still wasted effort), but horizontal clipping is hard. And depending upon how you intend to use them, they may be pointless (Plus hard sprites will naturally clip at screen edges for example but clipping "in screen" will get very complex)

Having one compiler to target software sprites, Plus sprites and v9990 might sound appealing, but the reality is that each has such drastically different use cases and capabilities that you're never realistically be using the same assets for different targets - software sprites are a very different kettle of fish in terms of performance and usage but even the Plus and v9990 hard sprites can each do things the other one can't get close to. It's not like the ST vs Amiga days where you could just target what was feasible on the ST and get away with the same asset usage on the Amiga.

News:

Plus sprite upload tips