drawing lines with cpcTelera / SDCC (V1.0 RELEASE)

ronaldo · 20:13, 16 August 16

@ervin whenever you compile a CPCtelera project, the obj/ folder is generated. Inside you can find compiled assembly for all your code, with detailed information on sizes and performance. You can look into .map file, which has all the generated global symbols. Using it is really easy to calculate sizes of every function from the code or bigger blocks. You also can use individal .rel files which show their binary size in their header (they are text files).

ervin · 00:27, 17 August 16

Quote from: ronaldo on 20:13, 16 August 16
@ervin whenever you compile a CPCtelera project, the obj/ folder is generated. Inside you can find compiled assembly for all your code, with detailed information on sizes and performance. You can look into .map file, which has all the generated global symbols. Using it is really easy to calculate sizes of every function from the code or bigger blocks. You also can use individal .rel files which show their binary size in their header (they are text files).

Thanks @ronaldo.
That's really good to know.

ervin · 16:36, 18 August 16

Optimisation of all the tight loops is nearly done now.

It takes around 5.2 seconds to draw the eye 3 times in mode 0, which I'm *very* happy with!
(It used to take 18!)
It will go under 5 seconds once finished.

In mode 1, the eye takes 7.7 seconds to draw 3 times, while in mode 2 it takes 11.8 seconds.
However modes 1 and 2 are one step behind mode 0 in optimisation (though I am being very careful to make sure all optimisations are kept in sync between the 3 modes).

Clipping optimisation has come a long way as well, and clipping is very fast now.

Once the speed optimisation is done, it'll be time to work on size.
The generated binary is too big, but there are lots of things I can do to change that.

ervin · 02:33, 19 August 16

The mode 0 program is now down to 7,113 bytes (including test code).
A few days ago it was 8,845 bytes!

Still got some more to do...

ervin · 16:07, 19 August 16

Alrighty, speed optimisation is pretty much done!
I may find minor things as I work on shrinking the program, but any further speed improvements I might discover probably won't be noticeable or measureable.

The following times represent the number of seconds to draw the eye 3 times.
Mode 0: 4.7
Mode 1: 6.4
Mode 2: 10.0

Very, very happy with the speed now!

Also, the size of the produced binaries continues to come down.
(Remember, the binaries also contain a lot of testing code).
Mode 0: 7,027 bytes
Mode 1: 7,063
Mode 2: 6,977

ervin · 13:56, 20 August 16

OOPS! I accidentally edited this post and erased everything in it!
The post that was here previously, mentioned a time of 4.4 seconds for the mode 0 eye to be drawn 3 times.

ervin · 01:52, 22 August 16

Alrighty, I changed the code around a bit, and removed some stuff that wasn't really needed.
The mode 0 binary is now down to 6,580 bytes, including test code.

But the really cool thing is that I found a couple more optimisations!
I had 3 little ideas over the weekend that I implemented carefully into one of the (4 main) routines.
And now the mode 0 eye is drawn 3 times in 3.9 seconds!!!

Now I need to put the same optimisations into the other 3 routines, and perhaps we'll get close to 1 second per-eye!
That would be mindblowing!

I never imagined that I would figure out how to make it run anything like this fast!

[EDIT] Hmmm... apart from giving away the source code, I think I should start thinking about making a game with this...

[EDIT#2] Oh well... looks like 3.6 seconds (for drawing the mode 0 eye 3 times) is as fast as it will get. Still, it's *very* fast, so I'm chuffed.
Now it's time to apply 2 of these optimisations to the clipping routines.

Docent · 16:43, 23 August 16

Quote from: ervin on 01:52, 22 August 16
Alrighty, I changed the code around a bit, and removed some stuff that wasn't really needed.
The mode 0 binary is now down to 6,580 bytes, including test code.

But the really cool thing is that I found a couple more optimisations!
I had 3 little ideas over the weekend that I implemented carefully into one of the (4 main) routines.
And now the mode 0 eye is drawn 3 times in 3.9 seconds!!!

Now I need to put the same optimisations into the other 3 routines, and perhaps we'll get close to 1 second per-eye!
That would be mindblowing!

I never imagined that I would figure out how to make it run anything like this fast!

Hi Ervin,
Do you have the compiled test code available somewhere? I could run it through my z80 profiler and provide results to help you with optimizations.

SRS · 19:45, 23 August 16

can we include this profiler into i.e. cpctelera toolchain ?

ervin · 23:58, 23 August 16

Quote from: Docent on 16:43, 23 August 16
Hi Ervin,
Do you have the compiled test code available somewhere? I could run it through my z80 profiler and provide results to help you with optimizations.

That sounds very interesting!
I'd love to see the results from something like that.
(I'm also very curious how such a profiler would work).

I've added a new download (bres_2.zip) to the first post in this thread.

Docent · 04:06, 25 August 16

Quote from: ervin on 23:58, 23 August 16

That sounds very interesting!
I'd love to see the results from something like that.
(I'm also very curious how such a profiler would work).

I've added a new download (bres_2.zip) to the first post in this thread.

Here's the result. I have to break it at #8041 after one loop. The numbers are in machine cycles, first goes the timing of executing one instruction, followed by total execution time.
The obvious place for optimization is l85c7 - it is called 4800 times and consumes 96% of execution time. Inside it there are l9689 and l91f0 taking the major amount of time. Send me a cdb file generated by sdcc for this code and I'll generate more meaningful disassembly with source and names.
The disassembly is generated by The Profiler - call-graph profiler and debugging simulator. It basically simulates execution of each instruction of a cpu and during it tracks calls, generates labels and counts the execution time, memory accesses, coverage etc.
btw - the coverage is around 47%, so there is a lot of unused code inside.

Docent · 04:17, 25 August 16

Quote from: SRS on 19:45, 23 August 16
can we include this profiler into i.e. cpctelera toolchain ?

It is windows ui application only, so probably unlikely...

ervin · 05:02, 25 August 16

Quote from: Docent on 04:06, 25 August 16
Here's the result. I have to break it at #8041 after one loop. The numbers are in machine cycles, first goes the timing of executing one instruction, followed by total execution time.
The obvious place for optimization is l85c7 - it is called 4800 times and consumes 96% of execution time. Inside it there are l9689 and l91f0 taking the major amount of time. Send me a cdb file generated by sdcc for this code and I'll generate more meaningful disassembly with source and names.
The disassembly is generated by The Profiler - call-graph profiler and debugging simulator. It basically simulates execution of each instruction of a cpu and during it tracks calls, generates labels and counts the execution time, memory accesses, coverage etc.
btw - the coverage is around 47%, so there is a lot of unused code inside.

Hmmm... very very interesting.
Thanks for taking the time to look into this.

The version I uploaded contains other tests, but access to them has been disabled while I concentrate on speeding up the routines used to draw the eye. That's probably why you only found 47% coverage.

Now, the routine that is called 4800 times... that's very likely the main drawLine routine, which goes on to call other routines (based on the octant the line sits in etc.)

The code called by drawLine is absolutely the busiest part of the code.
I've got it set to draw the eye 30 times per test, and each eye contains many dozens of lines.
So the bulk of the test is concerned with traversing Bresenham's algorithm.

Curious, what's a cdb file?
I'm not sure where to find it.

[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.

Docent · 13:41, 25 August 16

Quote from: ervin on 05:02, 25 August 16
[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.

Please just send the binary (or rel) and cdb file if possible - there is no need to put them on dsk, in such case I have to extract them back from dsk.

TFM · 15:13, 25 August 16

Quote from: ervin on 05:02, 25 August 16
[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.

A DSK would be very nice! Thank you!

ervin · 15:25, 25 August 16

Quote from: TFM on 15:13, 25 August 16

A DSK would be very nice! Thank you!

To see how fast the mode 0 eye is drawn now, download bres_2.zip from the first post in this thread.

ervin · 15:30, 25 August 16

Quote from: Docent on 13:41, 25 August 16
Please just send the binary (or rel) and cdb file if possible - there is no need to put them on dsk, in such case I have to extract them back from dsk.

Took a few moments to figure out how to create cdb files from cpctelera's build_config.mk file.
Anyway, the attached file contains the cdb, rel and bin files.

TFM · 15:35, 25 August 16

Quote from: ervin on 15:25, 25 August 16
To see how fast the mode 0 eye is drawn now, download bres_2.zip from the first post in this thread.

That is really good! Now the CPC would have needed that for BASIC or for the demonstration tape/disc coming with the computer.

ervin · 15:36, 25 August 16

Quote from: TFM on 15:35, 25 August 16

That is really good! Now the CPC would have needed that for BASIC or for the demonstration tape/disc coming with the computer.

Thanks.

I just wish I could make it a bit faster... I've taken it to the limit of my z80 abilities.

SRS · 18:22, 25 August 16

Quote from: ervin on 15:36, 25 August 16
I just wish I could make it a bit faster... I've taken it to the limit of my z80 abilities.

Did you try this ? for Mode 0 it is "enough"

http://www.edepot.com/lineex.html

Code Select


// Extremely Fast Line Algorithm Var E (Addition Fixed Point PreCalc ModeX) // Copyright 2001-2, By Po-Han Lin

 // Freely useable in non-commercial applications as long as credits // to Po-Han Lin and link to [url=http://www.edepot.com]http://www.edepot.com[/url] is provided in // source code and can been seen in compiled executable. // Commercial applications please inquire about licensing the algorithms. // // Lastest version at [url=http://www.edepot.com/phl.html]Technology Depot[/url] // Note: This version is for small displays like on cell phones. // with 256x256 max resolution.  For displays up to 65536x65536 // please visit [url=http://www.edepot.com/linee.html]http://www.edepot.com/linee.html[/url] // used by myLine void myPixel(SURFACE* surface, int x,int y) { // PLOT x,y point on surface } // THE EXTREMELY FAST LINE ALGORITHM Variation E (Addition Fixed Point PreCalc Small Display) // Small Display (256x256) resolution. void myLine(SURFACE* surface, int x, int y, int x2, int y2) { bool yLonger=false; int shortLen=y2-y; int longLen=x2-x; if (abs(shortLen)>abs(longLen)) { int swap=shortLen; shortLen=longLen; longLen=swap; yLonger=true; } int decInc; if (longLen==0) decInc=0; else decInc = (shortLen << 8) / longLen; if (yLonger) { if (longLen>0) { longLen+=y; for (int j=0x80+(x<<8);y<=longLen;++y) { myPixel(surface,j >> 8,y); j+=decInc; } return; } longLen+=y; for (int j=0x80+(x<<8);y>=longLen;--y) { myPixel(surface,j >> 8,y); j-=decInc; } return; } if (longLen>0) { longLen+=x; for (int j=0x80+(y<<8);x<=longLen;++x) { myPixel(surface,x,j >> 8); j+=decInc; } return; } longLen+=x; for (int j=0x80+(y<<8);x>=longLen;--x) { myPixel(surface,x,j >> 8); j-=decInc; } }

ervin · 23:57, 25 August 16

Quote from: SRS on 18:22, 25 August 16
Did you try this ? for Mode 0 it is "enough"

http://www.edepot.com/lineex.html

Code Select Expand
// Extremely Fast Line Algorithm Var E (Addition Fixed Point PreCalc ModeX) // Copyright 2001-2, By Po-Han Lin // Freely useable in non-commercial applications as long as credits // to Po-Han Lin and link to [url=http://www.edepot.com]http://www.edepot.com[/url] is provided in // source code and can been seen in compiled executable. // Commercial applications please inquire about licensing the algorithms. // // Lastest version at [url=http://www.edepot.com/phl.html]Technology Depot[/url] // Note: This version is for small displays like on cell phones. // with 256x256 max resolution. For displays up to 65536x65536 // please visit [url=http://www.edepot.com/linee.html]http://www.edepot.com/linee.html[/url] // used by myLine void myPixel(SURFACE* surface, int x,int y) { // PLOT x,y point on surface } // THE EXTREMELY FAST LINE ALGORITHM Variation E (Addition Fixed Point PreCalc Small Display) // Small Display (256x256) resolution. void myLine(SURFACE* surface, int x, int y, int x2, int y2) { bool yLonger=false; int shortLen=y2-y; int longLen=x2-x; if (abs(shortLen)>abs(longLen)) { int swap=shortLen; shortLen=longLen; longLen=swap; yLonger=true; } int decInc; if (longLen==0) decInc=0; else decInc = (shortLen << / longLen; if (yLonger) { if (longLen>0) { longLen+=y; for (int j=0x80+(x<<;y<=longLen;++y) { myPixel(surface,j >> 8,y); j+=decInc; } return; } longLen+=y; for (int j=0x80+(x<<;y>=longLen;--y) { myPixel(surface,j >> 8,y); j-=decInc; } return; } if (longLen>0) { longLen+=x; for (int j=0x80+(y<<;x<=longLen;++x) { myPixel(surface,x,j >>; j+=decInc; } return; } longLen+=x; for (int j=0x80+(y<<;x>=longLen;--x) { myPixel(surface,x,j >>; j-=decInc; } }

Hmmm... that's an interesting algorithm.
Looks very simple too.
Thanks!

I'll put together a test version shortly...

Docent · 03:56, 28 August 16

Quote from: ervin on 15:30, 25 August 16
Took a few moments to figure out how to create cdb files from cpctelera's build_config.mk file.
Anyway, the attached file contains the cdb, rel and bin files.

Thanks, here's the file.
Unfortunately, your cdb doesn't contain address bindings for telera calls - you probably need to compile it together to generate cdb with all symbols.
Also If you decide to make the source of main.c available, send it on and I'll generate profiling file in mixed c source/disassembly mode that is much easier to analyze.
I have a quick look at your code and there is some space for optimization left

In the main draw loop you do not use the AF' register. The following code in drawSteepLineDownMode0

Code Select


l959d:
	inc iy						;10T	    715320T	FD 23 
	exx						;4T	    238440T	D9 
	ld a,ixh					;0T	    476880T	DD 7C 
	dec a						;4T	    238440T	3D 
	jr nz,l95b2					;7/12T	    598320T	20 0D 
	rlc c						;8T	    234000T	CB 01 
	inc hl						;6T	    234000T	23 
	ld ixh,#00					;0T	    351000T	DD 26 00 
	rrc b						;8T	    234000T	CB 08 
	exx						;4T	    117000T	D9 
	djnz l9572					;8/13T	    466320T	10 C2 
	jr l95c1					;12T	      2520T	18 0F 
l95b2:
	rrc c						;8T	    242880T	CB 09 
	inc ixh						;0T	    242880T	DD 24 
	rrc b						;8T	    242880T	CB 08 
	exx						;4T	    121440T	D9 
	djnz l9572					;8/13T	    485760T	10 B7

can be replaced with :

Code Select


(old code commented out; you also need to setup a' earlier)
l959d:
	inc iy						;10T	    715320T	FD 23 
	exx						;4T	    238440T	D9 
ex af,af'                                    ;4T       238440T
;	ld a,ixh					;0T	    476880T	DD 7C 
	dec a						;4T	    238440T	3D 
	jr nz,l95b2					;7/12T	    598320T	20 0D 
	rlc c						;8T	    234000T	CB 01 
	inc hl						;6T	    234000T	23 
xor a                                        ;4T       117000T
;	ld ixh,#00					;0T	    351000T	DD 26 00 
	rrc b						;8T	    234000T	CB 08 
l95ad:
ex af,af'                                    ;4T       117000T
	exx						;4T	    117000T	D9 
	djnz l9572					;8/13T	    466320T	10 C2 
	jr l95c1					;12T	      2520T	18 0F 
l95b2:
	rrc c						;8T	    242880T	CB 09 
inc a                                         ;4T               125440T
inc a                                         ;4T               125440T
;	inc ixh						;0T	    242880T	DD 24 
	rrc b						;8T	    242880T	CB 08 
jr l95ad                                     ;12T             364320T
; 	exx						;4T	    121440T	D9 
;	djnz l9572					;8/13T	    485760T	10 B7

you'll gain 590320 cycles and ~2% speedup of this routine. Applying this to drawSteepLineUpMode0, drawGentleLineUpMode0: and drawGentleLineDownMode0 can bring you up to 8% gain.
Perhaps discardLineMode0 and initMasksMode0 could be also moved out of the drawing loop?

ervin · 04:38, 28 August 16

Of course!!! I totally forgot about AF' !!!
Thanks for the tip.

Fabulous! Very happy with that suggestion!
I'll put those in shortly.

discardLine and initMasks are only run once per line; they aren't inside a loop per se, so I think they'll be ok.
However, I can probably make them a little more efficient, once the main tight loop(s) are finished.

Docent · 14:30, 28 August 16

Quote from: ervin on 04:38, 28 August 16
Of course!!! I totally forgot about AF' !!!
Thanks for the tip.

Fabulous! Very happy with that suggestion!
I'll put those in shortly.

discardLine and initMasks are only run once per line; they aren't inside a loop per se, so I think they'll be ok.
However, I can probably make them a little more efficient, once the main tight loop(s) are finished.

Also operations on ixl are expensive - you could try to replace it with a' for eg.

Code Select


; setup a' earlier
l9572:
	exx						;4T	    757680T	D9 
	ld a,(hl)					;7T	   1515360T	7E 
	and b						;4T	    757680T	A0 
	or c						;4T	    757680T	B1 
	ld (hl),a					;7T	   1515360T	77 
ex af,af'	;4T 757680T	
;	ld a,ixl					;0T	   1515360T	DD 7D 
	cp #07						;7T	   1515360T	FE 07 
	jr nz,l9589					;7/12T	   2178360T	20 0C 
	ld de,#c850					;10T	    284040T	11 50 C8 
	add hl,de					;11T	    284040T	19 
	ld de,#0800					;10T	    284040T	11 00 08 
xor a	         ;4T         94680
;	ld ixl,#00	 				;0T	    284040T	DD 2E 00 
	jr l958c					;12T	    284040T	18 03 
l9589:
	add hl,de					;11T	   1989000T	19 
inc a		 ;4T       663000
;	inc ixl						;0T	   1326000T	DD 2C 
l958c:
ex af,af' ;4T	    757680T
	exx						;4T	    757680T	D9

You'll get the main loop cost down by 852360 cycles, which is 5.5% of the main loop.
The best approach would be to completely replace operation on ixl with another register - for example replace de' with sp, and use it instead af' I suggested above. As sp is used later, replace it there with bc. This also require more changes in other parts of the loop - you'll need to replace djnz with another check, for example using a' or ixl as a loop counter.

for example:

Code Select


; init de' to 7
l9572:
	exx						;4T	    757680T	D9 
	ld a,(hl)					;7T	   1515360T	7E 
	and b						;4T	    757680T	A0 
	or c						;4T	    757680T	B1 
	ld (hl),a					;7T	   1515360T	77 
ld a,d		;4T	    757680T
cp e		;4T	    757680T
;	ld a,ixl					;0T	   1515360T	DD 7D 
;	cp #07						;7T	   1515360T	FE 07 
	jr nz,l9589					;7/12T	   2178360T	20 0C 
ld sp, #c850
;	ld de,#c850					;10T	    284040T	11 50 C8 
add hl,sp
;	add hl,de					;11T	    284040T	19 
ld sp,#800
;	ld de,#0800					;10T	    284040T	11 00 08 
ld d,0	;8T 	189360
;	ld ixl,#00					;0T	    284040T	DD 2E 00 
	jr l958c					;12T	    284040T	18 03 
l9589:
add hl,sp
;	add hl,de					;11T	   1989000T	19 
inc d	;4T 663000
;	inc ixl						;0T	   1326000T	DD 2C 
l958c:
	exx						;4T	    757680T	D9 
add hl,bc
;	add hl,sp					;11T	   2273040T	39 
	add hl,de					;11T	   2273040T	19

This change will speed up this loop by 2367720 cycles - over 15%!

Another thing to do is branch optimization - try changing the code to not take the branch in main loop, for eg. replace

Code Select


l9572:
	exx						;4T	    757680T	D9 
	ld a,(hl)					;7T	   1515360T	7E 
	and b						;4T	    757680T	A0 
	or c						;4T	    757680T	B1 
	ld (hl),a					;7T	   1515360T	77 
	ld a,ixl					;0T	   1515360T	DD 7D 
	cp #07						;7T	   1515360T	FE 07 
	jr nz,l9589					;7/12T	   2178360T	20 0C 
	ld de,#c850					;10T	    284040T	11 50 C8 
	add hl,de					;11T	    284040T	19 
	ld de,#0800					;10T	    284040T	11 00 08 
	ld ixl,#00					;0T	    284040T	DD 2E 00 
	jr l958c					;12T	    284040T	18 03 
l9589:
	add hl,de					;11T	   1989000T	19 
	inc ixl						;0T	   1326000T	DD 2C 
l958c:

with

Code Select


l9572:
	exx						;4T	    757680T	D9 
	ld a,(hl)					;7T	   1515360T	7E 
	and b						;4T	    757680T	A0 
	or c						;4T	    757680T	B1 
	ld (hl),a					;7T	   1515360T	77 
	ld a,ixl					;0T	   1515360T	DD 7D 
	cp #07						;7T	   1515360T	FE 07 
jr z, nextblock
^^^
l9589:
	add hl,de					;11T	   1989000T	19 
	inc ixl						;0T	   1326000T	DD 2C 
l958c:

...
nextblock
 	ld de,#c850					;10T	    284040T	11 50 C8 
	add hl,de					;11T	    284040T	19 
	ld de,#0800					;10T	    284040T	11 00 08 
	ld ixl,#00					;0T	    284040T	DD 2E 00 
	jr l958c					;12T	    284040T	18 03

the cost of not taking the branch in jr is 7 cycles against 12 when the condition is met. In this case ixl is a counter from 0 to 7 so in your original code you are taking the branch 7 times and not taking 1. The total cost for one full iteration is 92 cpc cycles. When you invert the check the cost will be 68 cpc cycles - 27% gain! Only this small code reorganization will speed up your main loop by another ~2%

ervin · 15:24, 28 August 16

That's tremendous stuff!
Thank you!

News:

drawing lines with cpcTelera / SDCC (V1.0 RELEASE)