Fast Sprites
Here's some tips for sprite/screen handling.
1. The best screen width to use is 32 (MODE 1 characters in CRTC register 1). This is wide enough for most games, and it's special property is that if your screen base address is on a 64 byte boundary, no pixel row will cross a 256 byte page boundary, hence you can use INC reg8
to move to the next display byte rather than INC reg16
(where reg8 is an 8-bit register, and reg16 is a 16-bit register). This only saves 1 microsecond, but in a tight loop, every one counts.
2. Avoid using PUSH
/POP
inside your sprite loop. These are slow instructions. Similarly, avoid using index registers (IX+n) and (IY+n) are slow. One case where it may be wise to use an index register is for a loop counter, so you can keep BC for something else.
3. If it doesn't need to be transparent, don't draw it that way. The quickest way to move the data in most cases is LDI
. Draw solid sprites separately, and possible, build a large sprite from smalled ones with some transparent sections, some opaque.
4. Keep your sprite data from crossing page boundaries on pixel rows (or at all if possible). For the same reason as 1 above, you may be able to increment your data pointer using INC reg8
.
5. Use mask tables to save memory. If you've got plenty of memory left, you can put your AND and OR masks for each sprite with the sprite data, this is the fatest approach for transparent sprites:
LD A,(DE) AND (HL) INC L OR (HL) INC L LD (DE),A INC E ;Total = 11 μs per byte
But in most cases, you don't have that much memory to store all your graphics. You can create a 256 AND mask table, and possibly also a 256 byte OR mask table. This is the quickest way to mask the data while saving some memory.
LD A,(BC) INC C LD L,A LD A,(DE) AND (HL) INC H OR (HL) DEC H LD (DE),A INC E ;Total = 15 μs per byte
The other advantage of using this method is that with a small change to the above code you can quite easily use different mask tables. In ZACK, I use a reversing table which has the two MODE 0 pixels reversed, essentially flipping the byte left to right. By then changing the INC C
to a DEC C
(and starting at a different offset in the sprite data) it's easy to flip a sprite, saving memory for sprites which can face both ways. Another table has the OR masks all set to ink 15 for every non-transparent pixel. This can then be masked again with a colour to create a solid colour version of a sprite.
Out of registers?
As you can see with the above code, all the 8 bit registers are used. What's left for loop counters etc?
1. One method is to use the alternate register set (beware if the OS is still being used). The EXX
instruction will swap BC, DE and HL in one cycle.
2. Another method (if you're not worried about clipping and all your sprites are the same size) is to unroll the loop n times where n is the width in bytes of the sprite, eg. for 4 bytes wide:
.byte0 LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E .byte1 LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E .byte2 LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E .byte3 LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E
This is by far the fastest method if you make sure your sprites never need to clip. If they do need to clip, you can use some tricky self modifying code to loop part way through this code. eg. Store the address (byte0, byte1, byte2 or byte3) in IX and do PUSH IX:RET
to continue the loop. Non-indexed, ie. not (IX+n) are only one cycle slower than operations on HL.
3. Use undocumented 8 bit operations on IX and IY 8 bit values. You could use HX for your loop counter, for example:
.dobyte LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E DEC HX JR NZ,dobyte
The same of course applies to the unrolled version, where you could use HX for the row counter.
Incrementing the row
How do we start drawing the next screen line? Since we've already used all the registers, this can be quite tricky. Each scan line is offset by #800 bytes, except that every 8 scan lines the offset is actually (CRTC Register 1 * 2) - #3800. Also, we've been incrementing the screen address each byte by 1, so now our offset is also out by 4 bytes if the sprite is 4 bytes wide. Clipping the sprite makes this even more of a challenge, but I won't get into that :)
With the unrolled version above, with only 4 bytes wide, the easiest method would be to change the last INC E
to three DEC E
's, ie.
.byte3 LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:DEC E:DEC E:DEC E
Then we only need to decrement E two more times to get it back to the original state, and increment the line. If your sprite is more that 4 bytes wide, you're not using unrolled loops, you'll need to either subtract the width from E, or make sure you add the appropriate amount (eg. #7FC rather than #800), but this involved using 16 bit addition which is slower, and we're short of registers as usual.
The technique I normally use is to either use some self modifying code, eg.
LD A,E .sprwid equ $+1 SUB 4 LD E,A
Or to store the width in one of those undocumented 8 bit registers (HX, HY, LX or LY). eg.
LD A,E SUB LX LD E,A
Calculating the next screen address is usually relatively simple then, using 8 bit registers:
LD A,D ADD 8 LD D,A JR NC,nowrap LD A,#40 ADD E LD E,A LD A,#C0 ADC D LD D,A
There are of course numerous other ways of doing this, maybe even faster ones. One example is to have the stack pointing to a table of screen addresses for each row, and hold an offset in a register (with interrupts disabled, or interrupt safe timing), eg.
POP DE LD A,I ADD E LD E,A
The only thing left to do now is continue the loop for each row, depending on where you stored your loop counter(s), this could be as simple as:
DEC HY JR NZ,dobyte
A complete sprite routine
Here's a pretty fast transparent sprite drawing routine in full. Note that it assumes the screen with is 32 MODE 1 pixels, and the screen base is #C000 with CRTC register 9 set to 7, and has no clipping. Please note, I've also changed the loop so that it exits with RET Z
before calculation the next address if it's the last row. This also allowed me to use JR NC,rowloop
to continue the loop if the address doesn't wrap. Please also note that I haven't actually tried to assemble this code yet :) it may need a couple of tweaks to run!
;Entry: BC = Sprite address, D = width, E = height, HL = screen address LD LX,D ;LX = width LD HY,E ;HY = height EX DE,HL LD H,andmasks / 256 .rowloop LD HX,LX ;HX <= width .dobyte LD A,(BC):INC C LD L,A:LD A,(DE) AND (HL):INC H:OR (HL):DEC H LD (DE),A:INC E DEC HX JR NZ,dobyte DEC HY RET Z LD A,E SUB LX LD E,A LD A,D ADD 8 LD D,A JR NC,rowloop LD A,E ADD #40 LD E,A LD A,D ADC #C0 LD D,A JR rowloop
Further Optimisation
This code could be further improved by unrolling the loop, but that would involve using some self modifying code to patch the rowloop jumps. If you have a number of predetermined widths for your sprites, you could rewrite the routine unrolled for each available size. Note that this could make it difficult later if you want to clip the sprites.
If you don't need to either paint the sprite in a solid colour, or flip the sprite left to right (or you've got enough memory to store a flipped version), then you can optimise the routine to plot a single byte even further:
LD A,(BC) INC C LD L,A LD A,(DE) AND (HL) OR L LD (DE),A INC E ;Total = 12 μs per byte
As you can see, this is now down to 12 microseconds per byte, only 1 microsecond slower than storing the mask with the data, but the sprite data is half the size, so you can store twice as much graphics data in your left-over 2K of memory once you've got the double-buffered screens, music and sound fx code and game logic in place.
Clipping
Depending on the routine you use, clipping can be quite difficult or quite simple.
Clipping vertically is simply a matter of adjusting (a) the start offset in the sprite and (b), the number of rows (passed in E in the example code).
Clipping horizontally presents more of a problem. Simply adjusting the start offset and the width (passed in D in the example) won't do the job, because the INC C
won't happen enough times to increment the sprite offset to the next row. To get around this, either store the value of C in a spare register at the start of the loop (I think there's 2 left in the example: LY and I), then at the end of the horizontal loop add the width to the value (remember the width passed in is not the width of the data!), or precalculate the difference between the actual sprite width and the displayed sprite width, and add this value. This is probably the preferred method. eg.
At the start of the code, pass in the clipped width in E, and the sprite width in A for example:
SUB E:LD LY,A ;LY = sprite width - displayed width
Then at the end of the loop (just after RET Z
):
LD A,C:ADD LY:LD C,A
One last thought
Earlier on in this document, I mentioned not using index registers because they are slow. There could, however be some merit in using them, especially for unrolled loops to replace the BC register above. Using the index register may remove the need for the INC C
and the LD L,A
above, for example:
LD L,(IY+0) LD A,(DE) AND (HL) OR L LD (DE),A INC E ;Total = 13 μs per byte
This code is only 1 microsecond slower than the previous, but unrolled (using (IY+1), (IY+2) etc) it won't destroy the value in IY, and it also leaves BC free for loop counters or perhaps extra masks. Not destroying the value in IY means you can simply add the width in bytes to LY even when clipping to ensure the sprite data points to the next valid byte, but remember that adding a value to LY will take at least 5 microseconds, plus one extra microsecond per byte in the loop....
Executioner 19:55, 12 July 2006 (CDT)
If you have enough memory, the fastest way to plot sprites is to use the "direct addressing" methode: Here is a small example:
LD HL,screen address LD BC,#800+[most used byte] LD DE,#C050 LD (HL),byte1:INC HL ;plot line 1 LD (HL),byte2:INC HL LD (HL),C LD A,H:ADD B:LD H,A JR NC,line2 ADD HL,DE .line2 LD (HL),byte6:DEC HL ;plot line 2 LD (HL),C :DEC HL LD (HL),byte4 LD A,H:ADD B:LD H,A JR NC,line3 ADD HL,DE .line3 LD A,(HL):AND #55:OR byte7:LD (HL),A:INC HL ;plot line 3 (contains transparent areas) LD (HL),byte8:INC HL LD A,(HL):AND #AA:OR byte9:LD (HL),A RET
In mode 0 you will need two routines for every sprite, but I think there is no faster way to plot sprites on the CPC. Prodatron