News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu

UFD Tecnnology (Ultra-fast drawing)

Started by AHack, 16:31, 31 March 19

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

AHack

I have to say Pinball Dreams looks fantastic and really shows what the Amstrad can do.


Can anybody from the Batman Group expain what UFD is? I presume it's a form of compiled sprites. An explanation of this would be great for the community.


I'm asking, because I'm working on something myself which will push the Amstrad in the direction of a variable speed horozontal scrolling playfield. So far everything is going to plan and maybe in a couple of months I will reveal what I'm working on. But any new knowledge on how to speed up drawing would be welcome... Hence my cheeky ask ;)

GUNHED

If you want to see ultrafast drawing then look at games without hardware scroll and with lots of (software) sprites.

http://futureos.de --> Get the revolutionary FutureOS (Update: 2023.11.30)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

Rhino

Thanks!, lately I'm a little busy preparing a preview of our next project, but as soon as I have time I'll upload an example here.

ervin

Quote from: Rhino on 13:32, 02 April 19
Thanks!, lately I'm a little busy preparing a preview of our next project, but as soon as I have time I'll upload an example here.

I'm *very* interested in both parts of that sentence!
;D

AHack

Looking forward to your explanation.

AHack

I think I may have an idea on how Rhino could be doing this after thinking how you could speed up compiled sprites. The idea I came up with was to think of a sprite line as a span. The worst case for a span is if the ends need to mask with the background screen. I have not really tested this or worked out proper timings to see if if is quicker than a normal compiled sprite routine but by using stack abuse you fill the middle parts of a span with 16bit writes. Efectivly you can swap in and out to achieve holes and such in a span.


This is all untested and is basically an idea on what Ultra Fast Drawing could be and considers one case with masking at both ends of the span and a 16bit fill in the middle:


hl = screen address
de = add to get to next span
bc = wrap add


add hl,de will get poked with add hl,bc... Rhino discussed all this is in a thread about compiled sprites.




ld a,(hl)
and n
or n
ld (hl),a


dec hl
ld sp,hl
exx


ld bc,nn
ld de,nn
ld hl,nn
push hl
push de
push bc


exx
ld hl,0
add hl,sp



ld a,(hl)
and n
or n
ld (hl),a


add hl,de ; next line




Just an idea on what this could be :)






AHack

#6
Thinking about this some more you can improve things a lot. Here's an example of a compiled sprite routine that handles masks and caters for transparent holes. It also stores common bytes for quicker loads. All theory and untested.


hl = start screen address
de = next line add
bc = next line add with wrap


; The common bytes setup gets called at the beginning of the compiled sprite routine
exx
ld de,nn ;common bytes
ld bc,nn ;common bytes
exx


; if no masking needed you can just do ld (hl),n
ld a,(hl)
and n
or n
ld (hl),a


ld sp,hl
dec sp ;preserves hl address by decreasing the SP instead of hl to get the correct address. That way hl never changes until the next line.
exx


ld l,e
ld h,d
push hl
ld hl,nn
push hl
ld l,c
ld h,b
push hl
ld hl,nn
push hl


ld hl,0
add hl,sp


ld a,(hl)
and n
or n
ld (hl),a
dec hl
dec hl


ld sp,hl
ld hl,nn
push hl
dec sp
ld hl,nn
push hl
ld hl,nn
push hl


exx
; you can poke add hl,bc here to get correct next line
add hl,de




I really wish Z80 had this instruction:


ld hl,sp because ld hl,0, add hl,sp is pretty expensive to get the SP back into hl.




Depending on the span lengths and transparent bits it may be quicker to do a traditional compiled sprite sequence for some lines. Also if you leave out masking and do what Edge Grinder does for compiled sprites you can speed that up dramatically by using stack abuse. Anyway, this is what I think Rhino could be doing. It would be nice for him to confirm this ;)




AHack

If you were just doing transfers to the screen then things would get guicker. For example in Pinball Dreams all the flashing lights on the pinball table could be blitted this way without masking and transparencies and would be super quick. You could also do quick saving of the backgrounds for quick restores when sprites are in use. Abusing the stack is pretty powerful for fast graphics.


The blitting is faster in these cases because you can drop this sequence:


ld hl,0
add hl,sp

Rhino

Hi!,
I would like to make a more detailed explanation, but seeing that you are pointing in the right direction and I still don't have much time, here we go...

Some preliminary considerations:

* This method is efficient for medium/large sprites/gfxs (and especially in wide sprites), for small sprites the classic compiled method can be more efficient.

* In Pinball Dreams it is only used for the larger sprites/gfxs, such as flipper and some lights. In this case, the line jump is also specially optimized as it is always drawn in fixed positions, but here we will talk about a general implementation. In my current project I am making more extensive use of this technique.

In the attached file you can see an example of UFD sprite (for WinAPE assembler):



It is 24x50 = 1200 bytes and takes 1980 cycles to draw (1.65c per byte on average), although this is a particularly favorable case as it has many transparent bytes at the borders.

Well, to compare we will take as classic compiled sprite method the following:

ld (hl),x ; write unmasked byte

inc/dec l

* Note that the classic compiled sprites work worse when the screen width is different from 32 characters (256 pixels of mode 1), since it is necessary to do inc/dec HL instead of L. This makes it more interesting to use alternative methods such as optimized RLE in such cases, whereas UFD works for any screen width.

* The use of the stack forces to disable the interruptions during the drawing of the sprite, nevertheless, with a small variation it can work with the interruptions working with a maximum delay of 33 cycles (which makes possible line precision interruptions while drawing) to a cost of an additional 5% of speed.

* In classic compiled sprites the line jump is usually done with pop bc: add hl,bc , which requires 6c for all cases but in return the jump occurs in X aligned positions (at the beginning or end of the sprite) and it is necessary to make some L incs/decs if the first bytes are not used. In the generic implementation of UFD, line breaks are made by characters with: ld hl,x: add hl,sp , which are also 6 cycles, but they put the pointer in the exact place where to draw, and in 1 of every 8 lines it is necessary to check the carry flag obtaining as global result a better performance in line jumps for medium/large sprites.

* I cache the most repeated values in DE and during a line I load BC and HL with the data to write, this makes possible many optimizations (see example). All these optimizations must be solved through the tool that converts the graphics, although in my current tool all optimization possibilities are not supported. And the use of the stack should only be implemented in the lines where it is more efficient than using HL.

* Reducing the number of bytes with mask to the essential (adding pixels of background color in some cases) helps to increase performance considerably. If they are removed completely, A can also be used to cache data. The ideal is that the graphic designer knows the characteristics of optimization of this technique and take them into account when making the graphics and in the design of them.

* Classic compiled sprites to draw 2 bytes without mask need 4 additional bytes of code: ld (hl),x: inc l: ld (hl),x: inc l , UFD only needs 2 bytes of code for every 2 bytes of gfx: ld bc,x: push bc, on the other hand, the exchange of values between HL and SP and line jumps with carry check add additional bytes, but it require less memory in general.


Btw, congratulations AHack for imagining it before explaining it!

Regards

AHack

#9
Thanks for confirming this Rhino :) I was just considering a caching scheme as well but it's nice to know from your code that is a good direction to go down. Also, I never considered using added offsets into the ld hl,nn, add hls,sp sequence but I may of got there in the end ;) I guess once you start down this route you start to think what you can do to speed code up in special cases.


It's funny with low level optimisation coding (I've been coding games for 30 years plus) that if someone hints that something is possible then you can usually work out what it could be. But the secret is knowing that it is possible - it was fun working out what UFD was :)


I'm actually enjoying doing Z80 again... My own Amstrad CPC project is starting to get there now. It's basically test stuff at the momment but I'm astonished by what I'm getting this old 8bit computer to do. I think I can get a multi-speed horizontal scroller working at 50 fps with lots of sprites on the screen. I have a system in place where the last 3 interrupts are disabled where I can do lots of stack abuse - I call this area my stack abuse kernal which is highly optimised code that handles screen edge draw and sprites. Because the screen edge update is known timings I can do my CRTC changes at positions where I want. The first 3 interrupts work as normal for my CRTC changes and on the 4th I set a flag so the game loop knows the starting position on when I can do the stack abuse. And game code can just run as normal without interleaving halts all over the place. I really wish I had this knowledge back in the 80s :D


I'm also doing an added challenge to get all this to work for 64k machines.

AHack

@Rhino, I noticed you did not use the extra registors with the exx instruction. I think the way I imagined it with my untested code I wrote is that by using exx the next line calculations would be faster to do.


With scrolling you could do this which misses the jr instruction


exx
add hl,de ;poke add hl,bc here for getting the right line if needed for the current line
res 3,h


Also, the other set can hold common bytes in de and bc or for what ever you want those registors for.

AHack

I need to think about the next line stuff more because your method can get you anywhere on the next line and my method works like a traditional compiled sprite routine. But I still think there are gains by using the alternative registors. Anyway, all good fun thinking about it!

Rhino

Wow! You seem to have a great project in your hands, I hope everything goes well and you surprise us all!

I completely agree with your thought, the key is to question everything that is taken for granted. Premises such as "this already known technique is as fast as possible", without questioning anything, only make nobody devote themselves to seeking alternatives. And that is the philosophy that I try to apply in CPC since BF.

Yes, the use of extra registers to cache more data is pending in my tool. Although I noticed that in practice it doesn't apply as much as it might seem, if you only use it to cache values like this: exx: push bc: exx, you save 1c compared to ld bc,x: push bc, and the cached value must appear more than 3 times to compensate the initialization. You can do an exercise by applying that optimization manually in the example to see how many cycles you win. In addition, you should also deduct if you are taking advantage of the extra registers in the routine that calls to draw.

For line jumps I don't use it, since the slow code for this only appears 8 times per sprite max and in only one case the carry occurs. Eg, of the 50 lines of example sprite, 42 lines needs only 6c (SPRITE_NEXT_CHAR macro), while 7 lines needs 9c (SPRITE_NEXT_LINE macro) and only one needs 15c, 30 additional cycles in total, which is more than compensated by the precise pointer jumps. And, on the other hand, automodify the code by poke in the adds, you will usually needs more than 30c for the code that poke/restores the code.

AHack

Thanks for the heads up on the next line code considerations... When I get my own tool up and running I need to play around with all the different combinations of code sequences and see what gives the best timings. Yep, I agree with questioning everything... and also never be afraid to ask stupid questions to get better understandings. I basically started coding the Amstrad again after seeing Pinball Dreams running on real hardware and was taken aback with what you managed to do. I also curious on what your next project is :)


Hopefully I will have something to show soon. I will do a proof of concept demo to show that I'm serious about all this. The demo's purpose is to get people to notice and contribute to graphics and music. Basically I need a demo to show my intentions are serious. Looking forward to your next project!

SpDizzy

Quote from: AHack on 14:05, 27 April 19
I have a system in place where the last 3 interrupts are disabled where I can do lots of stack abuse - I call this area my stack abuse kernal which is highly optimised code that handles screen edge draw and sprites. Because the screen edge update is known timings I can do my CRTC changes at positions where I want. The first 3 interrupts work as normal for my CRTC changes and on the 4th I set a flag so the game loop knows the starting position on when I can do the stack abuse. And game code can just run as normal without interleaving halts all over the place.

First of all, excuse me if this is considered offtopic. I'm just learning, and some concepts are completely new to me, so maybe an stupid question arrives...
I think UFD it's a great technique, sadly beyond my knowledge (thanks @Rhino for the detailed explanation and @AHack for imagine the way it should work).
Not working with UFD but working with interruptions for different tasks, so I'm interested on the 'stack abuse' concept.

I know I can disable interrupts, doing the stack abuse, and then reenable interrumpts after Vsync, but what's going on during this 'stack abuse'? What are the benefits for doing it? It's a way to write faster / safer to screen during that stack abuse?
In my case some code during interrupts may take more than 52 scanlines, so I guess I'm on the risk of losing timings. May I benefit for doing that part of code during the 'stack abuse'?


I think this could be somehow related to the thread "Better understanding of interrupts. Help needed" there:
http://www.cpcwiki.eu/forum/programming/better-understanding-of-interrupts-help-needed/


Please excuse me if that's not the case, and this is offtopic, so I may delete it.
Thanks so much,




AHack

#15
I guess because UFD uses the stack pointer then a discussion on how to safely use it with interrupts is probably a must.


Rhino has an option to disable and enable interrupts in his UFD code example at the cost of cycles so that interrupts can work freely with it.


The reason why changing the stack pointer is dangerous is because an interrupt may suddenly get called and this could cause corruption or crashes so it's best to turn off interrupts and be safe when using stack abuse. I read your thread about interrupts and that is the wrong way to go about it (I did that mistake years ago in the 80s when I first programmed the Amstrad and it becomes a horrid mess to stop code overflowing cycles and such) and I propose a better way.


The best way is to only use interrupts for changes like CRTC, colour or screen mode changes. Then your code is free to not worry about how long code takes. Hopefully this code makes it clear for anybody wanting to stack abuse in a safe way.

org &4000
run start
;;write direct "a:output.bin"
nolist


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Data area
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


.Int4Reached
defb 0


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Interrupt area
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


; Use the interrupts to do any CRTC changes. Doing it here removes halts for the mainloop


int1:
push hl
ld hl,int2
ld (&0039),hl
pop hl
ei
ret


int2:
push hl
ld hl,int3
ld (&0039),hl
pop hl
ei
ret


int3:
push hl
ld hl,int4
ld (&0039),hl
pop hl
ei
ret


int4:
push af
ld a,#1
ld (Int4Reached),a
pop af
; Notice the interrupts are not turned on here
ret


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; Program area
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.start


; Set interupt handler
di
im 1
ld a,&c3
ld (&0038),a


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; MainLoop
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
.MainLoop
; Wait for the start of the vsync
ld b,&f5
.vsync
in a,(c)
rra
jr nc,vsync


; Set interrupt 1 and enable it
ld hl,int1
ld (&0039),hl
ei


; If this halt was commented back in you could in theory stack abuse for 52 scanlines
; But never over run or the interrupts will mess with stack
;halt
;ld (savedSP),sp
; Do stack abuse for approx 52 scanlines
;ld sp,(savedSP)


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; From here on you have 52 * 3 scanlines to process the game logic minus any stack abuse code
; if enabled
call Gamelogic
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


; Wait for interrupt 4
ld hl,Int4Reached
.WaitInt4
bit 0,(hl)
jr z,WaitInt4
ld (hl),0


; Save stack and abuse from here on
ld (savedSP),sp


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Code here that can use the stack pointer with no restrictions or fear of an interrupt
; You have basically 52*3 scanlines to use the stack how you want
;
; For my project that I'm working on now, this is my Stack Abuse kernal that draws graphics
; as fast as possible using the stack pointer. There is no fear of an interrupt screwing things up
;
; I can still time position for CRTC changes because some code blocks have known timings
;
; From here on using call will be bad but you can still call functions by jumping
; Use IY as the return address but never use it in the code called
ld iy,return1
jp TestRoutine
.return1


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


; Restore stack
ld sp,(savedSP)
jp MainLoop




.Gamelogic
ret


.TestRoutine
jp (iy) ;Return back

SpDizzy

Thanks so much @AHack for the concrete explanation and the piece of code, really clear and clever.
So I'm revisiting your 80's code errors, I guess it's never too late to make mistakes  :picard:
Hope your project will be successful, all the best wishes on the development.
Thanks for your support

ronaldo

#17
Quote from: Rhino on 15:18, 27 April 19
I completely agree with your thought, the key is to question everything that is taken for granted. Premises such as "this already known technique is as fast as possible", without questioning anything, only make nobody devote themselves to seeking alternatives. And that is the philosophy that I try to apply in CPC since BF.
Completely agree too. I keep constantly telling my students that there is no such thing as "the best algorithm", that everything depends always on constraints and goals. An algorithm that is "as fast as possible" in a given context probably is not in a different one, or does not meet other requirements such as memory constraints, compactability, usable from ROM...

It is always the mark of good engineers being thoughtful and critic, and facing each problem knowing previous solutions, but never assuming that it is "already solved". Even many well-known algorithms for 50 years, included in standard libraries, are improvable in hundreds of contexts.

BTW, nice work and discussion @Rhino and @AHack . These bits of knowledge you share are really valuable.

GUNHED

Quote from: Rhino on 15:18, 27 April 19
... the key is to question everything that is taken for granted. Premises such as "this already known technique is as fast as possible", without questioning anything, only make nobody devote themselves to seeking alternatives. And that is the philosophy that I try to apply in CPC since BF.
Very true.  :)  That was also the approach when I started coding FutureOS in about 1989. Search the best way of handling things while ignoring how others do it. And it's of course the most fun to create own routines and watch then running faster than anything else before.  :)
http://futureos.de --> Get the revolutionary FutureOS (Update: 2023.11.30)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

BSC

Quote from: AHack on 15:40, 27 April 19Hopefully I will have something to show soon. I will do a proof of concept demo to show that I'm serious about all this. The demo's purpose is to get people to notice and contribute to graphics and music. Basically I need a demo to show my intentions are serious. Looking forward to your next project!
I wonder what happened to your project, it sounded very intriguing. 
** My SID player/tracker AYAY Kaeppttn! on github **  Some CPC music and experiments ** Other music ** More music on scenestream (former nectarine) ** Some shaders ** Some Soundtrakker tunes ** Some tunes in Javascript

My hardware: ** Schneider CPC 464 with colour screen, 64k extension, 3" and 5,25 drives and more ** Amstrad CPC 6128 with M4 board, GreaseWeazle.

Nich

Quote from: BSC on 22:57, 16 January 24I wonder what happened to your project, it sounded very intriguing.
I assume the project is this Dropzone clone. @AHack hasn't logged on to the forum since March 2020 so I guess it's been abandoned, which is a shame.

BSC

Too bad, would liked to have seen the final game(s).
I hope AHack is ok, he quietly disappeared in early 2020 ..
** My SID player/tracker AYAY Kaeppttn! on github **  Some CPC music and experiments ** Other music ** More music on scenestream (former nectarine) ** Some shaders ** Some Soundtrakker tunes ** Some tunes in Javascript

My hardware: ** Schneider CPC 464 with colour screen, 64k extension, 3" and 5,25 drives and more ** Amstrad CPC 6128 with M4 board, GreaseWeazle.

Powered by SMFPacks Menu Editor Mod