News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_ervin

drawing lines with cpcTelera / SDCC (V1.0 RELEASE)

Started by ervin, 12:57, 19 July 16

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

ronaldo

@ervin whenever you compile a CPCtelera project, the obj/ folder is generated. Inside you can find compiled assembly for all your code, with detailed information on sizes and performance. You can look into .map file, which has all the generated global symbols. Using it is really easy to calculate sizes of every function from the code or bigger blocks. You also can use individal .rel files which show their binary size in their header (they are text files).

ervin

Quote from: ronaldo on 20:13, 16 August 16
@ervin whenever you compile a CPCtelera project, the obj/ folder is generated. Inside you can find compiled assembly for all your code, with detailed information on sizes and performance. You can look into .map file, which has all the generated global symbols. Using it is really easy to calculate sizes of every function from the code or bigger blocks. You also can use individal .rel files which show their binary size in their header (they are text files).


Thanks @ronaldo.
That's really good to know.
8)


ervin

Optimisation of all the tight loops is nearly done now.

It takes around 5.2 seconds to draw the eye 3 times in mode 0, which I'm *very* happy with!
(It used to take 18!)
It will go under 5 seconds once finished.
8)

In mode 1, the eye takes 7.7 seconds to draw 3 times, while in mode 2 it takes 11.8 seconds.
However modes 1 and 2 are one step behind mode 0 in optimisation (though I am being very careful to make sure all optimisations are kept in sync between the 3 modes).

Clipping optimisation has come a long way as well, and clipping is very fast now.

Once the speed optimisation is done, it'll be time to work on size.
The generated binary is too big, but there are lots of things I can do to change that.

ervin

The mode 0 program is now down to 7,113 bytes (including test code).
A few days ago it was 8,845 bytes!
8)

Still got some more to do...

ervin

Alrighty, speed optimisation is pretty much done!
I may find minor things as I work on shrinking the program, but any further speed improvements I might discover probably won't be noticeable or measureable.

The following times represent the number of seconds to draw the eye 3 times.
Mode 0: 4.7
Mode 1: 6.4
Mode 2: 10.0

Very, very happy with the speed now!

Also, the size of the produced binaries continues to come down.
(Remember, the binaries also contain a lot of testing code).
Mode 0: 7,027 bytes
Mode 1: 7,063
Mode 2: 6,977

ervin

#55
OOPS! I accidentally edited this post and erased everything in it!
The post that was here previously, mentioned a time of 4.4 seconds for the mode 0 eye to be drawn 3 times.


ervin

#56
Alrighty, I changed the code around a bit, and removed some stuff that wasn't really needed.
The mode 0 binary is now down to 6,580 bytes, including test code.

But the really cool thing is that I found a couple more optimisations!
I had 3 little ideas over the weekend that I implemented carefully into one of the (4 main) routines.
And now the mode 0 eye is drawn 3 times in 3.9 seconds!!!

Now I need to put the same optimisations into the other 3 routines, and perhaps we'll get close to 1 second per-eye!
That would be mindblowing!
8)

I never imagined that I would figure out how to make it run anything like this fast!

[EDIT] Hmmm... apart from giving away the source code, I think I should start thinking about making a game with this...

[EDIT#2] Oh well... looks like 3.6 seconds (for drawing the mode 0 eye 3 times) is as fast as it will get. Still, it's *very* fast, so I'm chuffed.
Now it's time to apply 2 of these optimisations to the clipping routines.

Docent

Quote from: ervin on 01:52, 22 August 16
Alrighty, I changed the code around a bit, and removed some stuff that wasn't really needed.
The mode 0 binary is now down to 6,580 bytes, including test code.

But the really cool thing is that I found a couple more optimisations!
I had 3 little ideas over the weekend that I implemented carefully into one of the (4 main) routines.
And now the mode 0 eye is drawn 3 times in 3.9 seconds!!!

Now I need to put the same optimisations into the other 3 routines, and perhaps we'll get close to 1 second per-eye!
That would be mindblowing!
8)

I never imagined that I would figure out how to make it run anything like this fast!
Hi Ervin,
Do you have the compiled test code available somewhere? I could run it through my z80 profiler and provide results to help you with optimizations.

SRS

can we include this profiler into i.e. cpctelera toolchain ?

ervin

#59
Quote from: Docent on 16:43, 23 August 16
Hi Ervin,
Do you have the compiled test code available somewhere? I could run it through my z80 profiler and provide results to help you with optimizations.


That sounds very interesting!
I'd love to see the results from something like that.
(I'm also very curious how such a profiler would work).

I've added a new download (bres_2.zip) to the first post in this thread.
8)

Docent

#60
Quote from: ervin on 23:58, 23 August 16

That sounds very interesting!
I'd love to see the results from something like that.
(I'm also very curious how such a profiler would work).

I've added a new download (bres_2.zip) to the first post in this thread.
8)

Here's the result. I have to break it at #8041 after one loop. The numbers are in machine cycles, first goes the timing of  executing one instruction, followed by total execution time.
The obvious place for optimization is l85c7 - it is called 4800 times and consumes 96% of execution time. Inside it there are l9689 and l91f0 taking the major amount of time. Send me a cdb file generated by sdcc for this code and I'll generate more meaningful disassembly with source and names.
The disassembly is generated by The Profiler - call-graph profiler and debugging simulator. It basically simulates execution of each instruction of a cpu and during it tracks calls, generates labels and counts the execution time, memory accesses, coverage etc.
btw - the coverage is around 47%, so there is a lot of unused code inside.

Docent

Quote from: SRS on 19:45, 23 August 16
can we include this profiler into i.e. cpctelera toolchain ?

It is windows ui application only, so probably unlikely...

ervin

#62
Quote from: Docent on 04:06, 25 August 16
Here's the result. I have to break it at #8041 after one loop. The numbers are in machine cycles, first goes the timing of  executing one instruction, followed by total execution time.
The obvious place for optimization is l85c7 - it is called 4800 times and consumes 96% of execution time. Inside it there are l9689 and l91f0 taking the major amount of time. Send me a cdb file generated by sdcc for this code and I'll generate more meaningful disassembly with source and names.
The disassembly is generated by The Profiler - call-graph profiler and debugging simulator. It basically simulates execution of each instruction of a cpu and during it tracks calls, generates labels and counts the execution time, memory accesses, coverage etc.
btw - the coverage is around 47%, so there is a lot of unused code inside.

Hmmm... very very interesting.
Thanks for taking the time to look into this.

The version I uploaded contains other tests, but access to them has been disabled while I concentrate on speeding up the routines used to draw the eye. That's probably why you only found 47% coverage.

Now, the routine that is called 4800 times... that's very likely the main drawLine routine, which goes on to call other routines (based on the octant the line sits in etc.)

The code called by drawLine is absolutely the busiest part of the code.
I've got it set to draw the eye 30 times per test, and each eye contains many dozens of lines.
So the bulk of the test is concerned with traversing Bresenham's algorithm.

Curious, what's a cdb file?
I'm not sure where to find it.

[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.

Docent

Quote from: ervin on 05:02, 25 August 16
[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.

Please just send the binary (or rel) and cdb file if possible - there is no need to put them on dsk, in such case I have to extract them back from dsk.

TFM

Quote from: ervin on 05:02, 25 August 16
[EDIT] I found information about cdb files.
I'll hopefully have time tonight to generate a new DSK file, along with the CDB file.


A DSK would be very nice! Thank you!  :)
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

ervin

Quote from: TFM on 15:13, 25 August 16

A DSK would be very nice! Thank you!  :)

To see how fast the mode 0 eye is drawn now, download bres_2.zip from the first post in this thread.
:)

ervin

Quote from: Docent on 13:41, 25 August 16
Please just send the binary (or rel) and cdb file if possible - there is no need to put them on dsk, in such case I have to extract them back from dsk.

Took a few moments to figure out how to create cdb files from cpctelera's build_config.mk file.
Anyway, the attached file contains the cdb, rel and bin files.

TFM

Quote from: ervin on 15:25, 25 August 16
To see how fast the mode 0 eye is drawn now, download bres_2.zip from the first post in this thread.
:)


That is really good! Now the CPC would have needed that for BASIC or for the demonstration tape/disc coming with the computer.  :)
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

ervin

Quote from: TFM on 15:35, 25 August 16

That is really good! Now the CPC would have needed that for BASIC or for the demonstration tape/disc coming with the computer.  :)

Thanks.
;D

I just wish I could make it a bit faster... I've taken it to the limit of my z80 abilities.

SRS

Quote from: ervin on 15:36, 25 August 16
I just wish I could make it a bit faster... I've taken it to the limit of my z80 abilities.

Did you try this ? for Mode 0 it is "enough"

http://www.edepot.com/lineex.html


// Extremely Fast Line Algorithm Var E (Addition Fixed Point PreCalc ModeX) // Copyright 2001-2, By Po-Han Lin

// Freely useable in non-commercial applications as long as credits // to Po-Han Lin and link to [url=http://www.edepot.com]http://www.edepot.com[/url] is provided in // source code and can been seen in compiled executable. // Commercial applications please inquire about licensing the algorithms. // // Lastest version at [url=http://www.edepot.com/phl.html]Technology Depot[/url] // Note: This version is for small displays like on cell phones. // with 256x256 max resolution.  For displays up to 65536x65536 // please visit [url=http://www.edepot.com/linee.html]http://www.edepot.com/linee.html[/url] // used by myLine void myPixel(SURFACE* surface, int x,int y) { // PLOT x,y point on surface } // THE EXTREMELY FAST LINE ALGORITHM Variation E (Addition Fixed Point PreCalc Small Display) // Small Display (256x256) resolution. void myLine(SURFACE* surface, int x, int y, int x2, int y2) { bool yLonger=false; int shortLen=y2-y; int longLen=x2-x; if (abs(shortLen)>abs(longLen)) { int swap=shortLen; shortLen=longLen; longLen=swap; yLonger=true; } int decInc; if (longLen==0) decInc=0; else decInc = (shortLen << 8) / longLen; if (yLonger) { if (longLen>0) { longLen+=y; for (int j=0x80+(x<<8);y<=longLen;++y) { myPixel(surface,j >> 8,y); j+=decInc; } return; } longLen+=y; for (int j=0x80+(x<<8);y>=longLen;--y) { myPixel(surface,j >> 8,y); j-=decInc; } return; } if (longLen>0) { longLen+=x; for (int j=0x80+(y<<8);x<=longLen;++x) { myPixel(surface,x,j >> 8); j+=decInc; } return; } longLen+=x; for (int j=0x80+(y<<8);x>=longLen;--x) { myPixel(surface,x,j >> 8); j-=decInc; } }

ervin

Quote from: SRS on 18:22, 25 August 16
Did you try this ? for Mode 0 it is "enough"

http://www.edepot.com/lineex.html


// Extremely Fast Line Algorithm Var E (Addition Fixed Point PreCalc ModeX) // Copyright 2001-2, By Po-Han Lin

// Freely useable in non-commercial applications as long as credits // to Po-Han Lin and link to [url=http://www.edepot.com]http://www.edepot.com[/url] is provided in // source code and can been seen in compiled executable. // Commercial applications please inquire about licensing the algorithms. // // Lastest version at [url=http://www.edepot.com/phl.html]Technology Depot[/url] // Note: This version is for small displays like on cell phones. // with 256x256 max resolution.  For displays up to 65536x65536 // please visit [url=http://www.edepot.com/linee.html]http://www.edepot.com/linee.html[/url] // used by myLine void myPixel(SURFACE* surface, int x,int y) { // PLOT x,y point on surface } // THE EXTREMELY FAST LINE ALGORITHM Variation E (Addition Fixed Point PreCalc Small Display) // Small Display (256x256) resolution. void myLine(SURFACE* surface, int x, int y, int x2, int y2) { bool yLonger=false; int shortLen=y2-y; int longLen=x2-x; if (abs(shortLen)>abs(longLen)) { int swap=shortLen; shortLen=longLen; longLen=swap; yLonger=true; } int decInc; if (longLen==0) decInc=0; else decInc = (shortLen << / longLen; if (yLonger) { if (longLen>0) { longLen+=y; for (int j=0x80+(x<<;y<=longLen;++y) { myPixel(surface,j >> 8,y); j+=decInc; } return; } longLen+=y; for (int j=0x80+(x<<;y>=longLen;--y) { myPixel(surface,j >> 8,y); j-=decInc; } return; } if (longLen>0) { longLen+=x; for (int j=0x80+(y<<;x<=longLen;++x) { myPixel(surface,x,j >>; j+=decInc; } return; } longLen+=x; for (int j=0x80+(y<<;x>=longLen;--x) { myPixel(surface,x,j >>; j-=decInc; } }


Hmmm... that's an interesting algorithm.
Looks very simple too.
Thanks!

I'll put together a test version shortly...

Docent

#71
Quote from: ervin on 15:30, 25 August 16
Took a few moments to figure out how to create cdb files from cpctelera's build_config.mk file.
Anyway, the attached file contains the cdb, rel and bin files.

Thanks, here's the file.
Unfortunately, your cdb doesn't contain address bindings for telera calls - you probably need to compile it together to generate cdb with all symbols.
Also If you decide to make the source of main.c available, send it on and I'll generate profiling file in mixed c source/disassembly mode that is much easier to analyze.
I have a quick look at your code and there is some space for optimization left :)
In the main draw loop you do not use the AF' register. The following code in drawSteepLineDownMode0


l959d:
inc iy ;10T     715320T FD 23
exx ;4T     238440T D9
ld a,ixh ;0T     476880T DD 7C
dec a ;4T     238440T 3D
jr nz,l95b2 ;7/12T     598320T 20 0D
rlc c ;8T     234000T CB 01
inc hl ;6T     234000T 23
ld ixh,#00 ;0T     351000T DD 26 00
rrc b ;8T     234000T CB 08
exx ;4T     117000T D9
djnz l9572 ;8/13T     466320T 10 C2
jr l95c1 ;12T       2520T 18 0F
l95b2:
rrc c ;8T     242880T CB 09
inc ixh ;0T     242880T DD 24
rrc b ;8T     242880T CB 08
exx ;4T     121440T D9
djnz l9572 ;8/13T     485760T 10 B7

can be replaced with :

(old code commented out; you also need to setup a' earlier)
l959d:
inc iy ;10T     715320T FD 23
exx ;4T     238440T D9
ex af,af'                                    ;4T       238440T
; ld a,ixh ;0T     476880T DD 7C
dec a ;4T     238440T 3D
jr nz,l95b2 ;7/12T     598320T 20 0D
rlc c ;8T     234000T CB 01
inc hl ;6T     234000T 23
xor a                                        ;4T       117000T
; ld ixh,#00 ;0T     351000T DD 26 00
rrc b ;8T     234000T CB 08
l95ad:
ex af,af'                                    ;4T       117000T
exx ;4T     117000T D9
djnz l9572 ;8/13T     466320T 10 C2
jr l95c1 ;12T       2520T 18 0F
l95b2:
rrc c ;8T     242880T CB 09
inc a                                         ;4T               125440T
inc a                                         ;4T               125440T
; inc ixh ;0T     242880T DD 24
rrc b ;8T     242880T CB 08
jr l95ad                                     ;12T             364320T
; exx ;4T     121440T D9
; djnz l9572 ;8/13T     485760T 10 B7


you'll gain 590320 cycles and ~2% speedup of this routine. Applying this to drawSteepLineUpMode0, drawGentleLineUpMode0: and drawGentleLineDownMode0 can bring you up to 8% gain.
Perhaps discardLineMode0 and initMasksMode0 could be also moved out of the drawing loop?

ervin

Of course!!! I totally forgot about AF' !!!
Thanks for the tip.

Fabulous! Very happy with that suggestion!
I'll put those in shortly.

discardLine and initMasks are only run once per line; they aren't inside a loop per se, so I think they'll be ok.
However, I can probably make them a little more efficient, once the main tight loop(s) are finished.
8)

Docent

#73
Quote from: ervin on 04:38, 28 August 16
Of course!!! I totally forgot about AF' !!!
Thanks for the tip.

Fabulous! Very happy with that suggestion!
I'll put those in shortly.

discardLine and initMasks are only run once per line; they aren't inside a loop per se, so I think they'll be ok.
However, I can probably make them a little more efficient, once the main tight loop(s) are finished.
8)

Also operations on ixl are expensive - you could try to replace it with a' for eg.

; setup a' earlier
l9572:
exx ;4T     757680T D9
ld a,(hl) ;7T    1515360T 7E
and b ;4T     757680T A0
or c ;4T     757680T B1
ld (hl),a ;7T    1515360T 77
ex af,af' ;4T 757680T
; ld a,ixl ;0T    1515360T DD 7D
cp #07 ;7T    1515360T FE 07
jr nz,l9589 ;7/12T    2178360T 20 0C
ld de,#c850 ;10T     284040T 11 50 C8
add hl,de ;11T     284040T 19
ld de,#0800 ;10T     284040T 11 00 08
xor a          ;4T         94680
; ld ixl,#00 ;0T     284040T DD 2E 00
jr l958c ;12T     284040T 18 03
l9589:
add hl,de ;11T    1989000T 19
inc a ;4T       663000
; inc ixl ;0T    1326000T DD 2C
l958c:
ex af,af' ;4T     757680T
exx ;4T     757680T D9


You'll get the main loop cost down by 852360 cycles, which is 5.5% of the main loop.
The best approach would be to completely replace operation on ixl with another register - for example replace de' with sp, and use it instead af' I suggested above. As sp is used later, replace it there with bc. This also require more changes in other parts of the loop - you'll need to replace djnz with another check, for example using a' or ixl as a loop counter.

for example:

; init de' to 7
l9572:
exx ;4T     757680T D9
ld a,(hl) ;7T    1515360T 7E
and b ;4T     757680T A0
or c ;4T     757680T B1
ld (hl),a ;7T    1515360T 77
ld a,d ;4T     757680T
cp e ;4T     757680T
; ld a,ixl ;0T    1515360T DD 7D
; cp #07 ;7T    1515360T FE 07
jr nz,l9589 ;7/12T    2178360T 20 0C
ld sp, #c850
; ld de,#c850 ;10T     284040T 11 50 C8
add hl,sp
; add hl,de ;11T     284040T 19
ld sp,#800
; ld de,#0800 ;10T     284040T 11 00 08
ld d,0 ;8T 189360
; ld ixl,#00 ;0T     284040T DD 2E 00
jr l958c ;12T     284040T 18 03
l9589:
add hl,sp
; add hl,de ;11T    1989000T 19
inc d ;4T 663000
; inc ixl ;0T    1326000T DD 2C
l958c:
exx ;4T     757680T D9
add hl,bc
; add hl,sp ;11T    2273040T 39
add hl,de ;11T    2273040T 19


This change will speed up this loop by 2367720 cycles - over 15%!

Another thing to do is branch optimization - try changing the code to not take the branch in main loop, for eg. replace


l9572:
exx ;4T     757680T D9
ld a,(hl) ;7T    1515360T 7E
and b ;4T     757680T A0
or c ;4T     757680T B1
ld (hl),a ;7T    1515360T 77
ld a,ixl ;0T    1515360T DD 7D
cp #07 ;7T    1515360T FE 07
jr nz,l9589 ;7/12T    2178360T 20 0C
ld de,#c850 ;10T     284040T 11 50 C8
add hl,de ;11T     284040T 19
ld de,#0800 ;10T     284040T 11 00 08
ld ixl,#00 ;0T     284040T DD 2E 00
jr l958c ;12T     284040T 18 03
l9589:
add hl,de ;11T    1989000T 19
inc ixl ;0T    1326000T DD 2C
l958c:


with

l9572:
exx ;4T     757680T D9
ld a,(hl) ;7T    1515360T 7E
and b ;4T     757680T A0
or c ;4T     757680T B1
ld (hl),a ;7T    1515360T 77
ld a,ixl ;0T    1515360T DD 7D
cp #07 ;7T    1515360T FE 07
jr z, nextblock
^^^
l9589:
add hl,de ;11T    1989000T 19
inc ixl ;0T    1326000T DD 2C
l958c:

...
nextblock
ld de,#c850 ;10T     284040T 11 50 C8
add hl,de ;11T     284040T 19
ld de,#0800 ;10T     284040T 11 00 08
ld ixl,#00 ;0T     284040T DD 2E 00
jr l958c ;12T     284040T 18 03


the cost of not taking the branch in jr is 7 cycles against 12 when the condition is met. In this case ixl is a counter from 0 to 7 so in your original code you are taking the branch 7 times and not taking 1. The total cost for one full iteration is 92 cpc cycles. When you invert the check the cost will be 68 cpc cycles - 27% gain! Only this small code reorganization will speed up your main loop by another ~2%

ervin

That's tremendous stuff!
Thank you!

Powered by SMFPacks Menu Editor Mod