Author Topic: the fast division  (Read 4202 times)

0 Members and 1 Guest are viewing this topic.

Offline litwr

  • CPC6128
  • ****
  • Posts: 162
  • Country: ru
    • lidovski's www page
  • Liked: 131
  • Likes Given: 131
the fast division
« on: 17:36, 14 November 15 »
I've just added a new fast 32/16->32:16 division to CPCwiki - Programming:Integer Division - CPCWiki
It is 3-5 times faster than the older one presented there - Programming:Integer Division - CPCWiki
It is slightly odd that the new fast division is written purely in Intel 8080 codes.  :o   So I hope somebody may find several z80 features which may add more acceleration to 8080.
BTW the divisions at SourceForge.net Repository - [z88dk] Index of are incompatible with AMSDOS and miss 32/16 bit division.  :(

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.900
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 1112
  • Likes Given: 1879
Re: the fast division
« Reply #1 on: 22:57, 14 November 15 »
BTW the divisions at SourceForge.net Repository - [z88dk] Index of are incompatible with AMSDOS and miss 32/16 bit division.  :(


Don't seem to be getting out of that link (appears to be broken).


Not sure what's happening with that z88dk and can only presume anyone who was using it, is now using CPCtelera?


EDIT: Just checked SourceForge, looks like their doing some upgrades to their site.
« Last Edit: 22:59, 14 November 15 by AMSDOS »
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Offline reidrac

  • Supporter
  • 6128 Plus
  • *
  • Posts: 905
  • Country: gb
  • Trying to gamedev!
    • index.php?action=treasury
    • usebox.net
  • Liked: 1667
  • Likes Given: 896
Re: the fast division
« Reply #2 on: 23:08, 14 November 15 »
Not sure what's happening with that z88dk and can only presume anyone who was using it, is now using CPCtelera?

I use it for ZX Spectrum development (and it targets other platforms too). I started to use it for CPC cross-development, but the CPC support is even more hairy than the Speccy one so I moved everything to use SDCC (although I don't use cpctelera). AFAIK next version will support SDCC as backend (don't really understand how it works).

In mi experience SDCC is not brilliant either, and in some points it's been worse than Z88DK (as in generating broken code; Z88DK tends to segfault instead!); but when it works as expected, it is OK and for me is better than plain ASM!

EDIT: sorry, I digress. My point was that Z88DK is not completely dead yet and it is used to cross-develop for other platforms (eg, ZX Spectrum; it includes SP1 that is a quite nice tile/sprite library).
« Last Edit: 23:10, 14 November 15 by reidrac »
Released The Return of Traxtor, Golden Tail, Magica, The Dawn of Kernel, Kitsune`s Curse and Brick Rick for the CPC.

If you like my games and want to show some appreciation, you can always buy me a coffee.

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.900
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 1112
  • Likes Given: 1879
Re: the fast division
« Reply #3 on: 23:28, 14 November 15 »
I generally keep a rule that if a language is being used, it's not really dead.  :)


z88dk also has an advantage of being translated from CPC BASIC 3:)
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Offline ervin

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.356
  • Country: au
    • index.php?action=treasury
  • Liked: 1016
  • Likes Given: 1201
Re: the fast division
« Reply #4 on: 09:10, 15 November 15 »
I generally keep a rule that if a language is being used, it's not really dead.  :)

z88dk also has an advantage of being translated from CPC BASIC 3:)

I think it's ccz80 that you're referring to, not z88dk.
 ;)
My (cancelled) entry for the CPCRetroDev 2017 Competition http://www.cpcwiki.eu/forum/programming/my-cpcretrodev-2017-entry/
FAST line drawing in CPCtelera http://www.cpcwiki.eu/forum/programming/drawing-lines-with-cpctelera-sdcc/
RUNCPC My entry for the CPCRetroDev 2015 Competition http://www.cpc-power.com/index.php?page=detail&num=12494

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.900
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 1112
  • Likes Given: 1879
Re: the fast division
« Reply #5 on: 11:03, 15 November 15 »
I think it's ccz80 that you're referring to, not z88dk.
 ;)


* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Offline litwr

  • CPC6128
  • ****
  • Posts: 162
  • Country: ru
    • lidovski's www page
  • Liked: 131
  • Likes Given: 131
Re: the fast division
« Reply #6 on: 18:55, 16 November 15 »
I've just added z80 features and other improvements.  I hope it is one of the best z80 division.  ;D

Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #7 on: 06:01, 20 November 15 »
It is slightly odd that the new fast division is written purely in Intel 8080 codes.  :o   So I hope somebody may find several z80 features which may add more acceleration to 8080.
BTW the divisions at SourceForge.net Repository - [z88dk] Index of are incompatible with AMSDOS and miss 32/16 bit division.  :(

Once the operands are large enough, you need those extra registers on the z80 for best performance.  In C, 32/16 division does not occur because the compiler will promote the int to long type before the division.  Those routines are meant to be the small integer math implementation - there is a fast integer math implementation that costs about 1k (not including the unrolled version) here:

SourceForge.net Repository - [z88dk] Index of

That one is a lot messier to look at but what it does is first try to reduce the operation, ie change 32x32 mult to 8x8 or all points between if it can.  Then instead of jumping into the loop, it eliminates leading zero bits first.  Then it sets up the loop as if the first loop operation has already occurred and then the general loop is entered.  Because it does all sorts of operand sizes in 8-bit increments, it can use specialized code for those sizes.  Then there is the option to unroll loops but this is very expensive in terms of memory.

It can be a lot faster - you can have a look at some benchmarking results at z88dk.org .  That program computes pi to 800 decimal digits using an algorithm that contains a lot of 32-bit divisions and some multiplies.  It's not the fastest algorithm for computing pi but it is really good for measuring 32-bit integer performance.

From the benchmark you can see the fast integer library is about 2.5 times faster than the small integer math library.  Also take a look at the sdcc result:  on its own, sdcc is 6 times slower than sdcc in combination with z88dk and the fast library.. it's the cost of sdcc's library being implemented in C rather than assembler.


Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #8 on: 06:24, 20 November 15 »
Not sure what's happening with that z88dk and can only presume anyone who was using it, is now using CPCtelera?
EDIT: Just checked SourceForge, looks like their doing some upgrades to their site.

It's very active gents :)  (I'm one of the z88dk devs you see)  There's a lot of activity it's just that it's not on the sf page and the download at sf is deprecated in favour of the nightly builds.  New documentation with front page is under construction:

temp:front [z88dk]


sdcc provides a standards-compliant optimizing C compiler with a partial library implemented in C.
z88dk provides a small-C derived compiler with a comprehensive library written in assembly language.

We put the two together :)  But we did more than that -- we improve the code that sdcc generates so that it is 5-15% smaller and faster (just the C code) and we link sdcc to the z88dk libraries so it has a much more complete c library written in assembly language.  The combination makes it seem very similar to 32-bit compilers in features.  We're just missing the disk i/o which is coming soon.

sdcc also gets access to z88dk's crts.  This means you can just say "build for cpm" or "build for zx spectrum" and it will do so.  There are only three targets completed with sdcc compatibility -- embedded, cpm and zx no cpc yet, sorry.  The crts are also very flexible -- it can build using a ram model, for instance, which is most appropriate for the home computers that run programs from ram.  A ram build strips sdcc's initializer stuff out which also leads to smaller binaries.  You can choose heap size, code origin, data origin, stack location, etc with pragmas in the C source code, though the crt will have common defaults for the selected target.  You also get to customize the c library, for instance you can eliminate printf or scanf convertes your program does not use in order to reduce binary size further or you can choose between fast and slow integer math, special functions for ascii<->text conversion and so on.

You can list as many c source, asm files or object files on the compile line that you want and the compiler will magically put it all together for you.  Since the back end is not asz80 but rather z80asm, assembly language uses Zilog mnemonics.  You can create as many sections as you want to map to bankswitched memory and the linker will generate independent binaries for each section so you can place that in memory.  No hex2bin.  Within the next few days it will also be generating relocate maps so that output binaries can be dynamically patched to run at any address.  We're doing this specifically for symbos but there are other targets and applications that could use this too.

And, of course, it works just as well on windows.

It sounds a bit ad-ish, sorry, but historically z88dk has been very bad at documenting things so people tend to be unaware of what's in there.

In mi experience SDCC is not brilliant either, and in some points it's been worse than Z88DK (as in generating broken code; Z88DK tends to segfault instead!); but when it works as expected, it is OK and for me is better than plain ASM!

Under z88dk we've fixed a couple of the worst bugs with the peephole optimizer.  They have to do with loading and storing values to static memory.  With that in place I haven't run into any (silent) sdcc faults of late and I've been doing a lot of compiling lately to check where the code quality can be improved. 

z88dk's compiler (sccz80) is generally compiling to smaller code.  We've started to do a few comparisons.  It used to almost always be smaller (and sometimes 20% smaller!) but we've added many peephole rules that improve sdcc's code when dealing with statics in particular and now it's not always the case that sccz80 produces smaller code.   The two are generally not too far apart now.

Quote
AFAIK next version will support SDCC as backend (don't really understand how it works).

It's easy:

zcc +zx -vn -SO3 -clib=sdcc_ix --reserve-regs-iy --max-allocs-per-node200000 prog.c -o prog
for zx spectrum build using sdcc (including sp1)  The output will be a binary "prog_CODE.bin" with ORG 32768.

zcc +cpm -vn -SO3 -clib=sdcc_iy --max-allocs-per-node200000 prog.c -o prog
for cpm build using sdcc.  The output will be "prog_CODE.bin" which can be renamed "prog.com"

After that you can look at choosing an appropriate crt (with -startup=n) and select appropriate lib and crt options to get rid of stuff you don't need.  Especially on the zx target you can shave quite a bit off executables with appropriate selections.

You need a special sdcc build for this to work and if you're on windows, the "zsdcc" exe is available for download.  Just copy that to sdcc's bin directory and you're good to go.
« Last Edit: 07:24, 20 November 15 by Alcoholics Anonymous »

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.900
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 1112
  • Likes Given: 1879
Re: the fast division
« Reply #9 on: 10:01, 20 November 15 »
It's very active gents :)  (I'm one of the z88dk devs you see)  There's a lot of activity it's just that it's not on the sf page and the download at sf is deprecated in favour of the nightly builds.  New documentation with front page is under construction:


No chance of a CPC BASIC 3<->Z88DK Joint Venture or visa-versa?  8)
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #10 on: 20:05, 20 November 15 »
No chance of a CPC BASIC 3<->Z88DK Joint Venture or visa-versa?  8)

Well it's not hard to connect the two projects.  CPC BASIC 3 can generate asm output which z88dk can assemble and link against its libraries.  But it's really up to the author of CPC BASIC 3 if it's advantageous to make use of the z88dk libraries.  He could get a lot of code for free -- floating point math, eg, without going through rsx.  But we don't have a cpc crt that can cooperate with the firmware yet.  The library freely uses all z80 features, which means it's using the EXX registers for example and CPC BASIC seems to be built on using the firmware to generate output.  One of the main reasons we don't have an "invisible" cpc crt is because we (the developers) are not terribly familiar with the cpc so we would have to do a lot of research into what parts of memory we're allowed to use, what we need to do to cooperate with firmware, if it makes sense to ignore the firmware in some projects and so on.  Using the z88dk libraries would mean solving some of those issues.

I think it's Dinoneno that is behind CPC BASIC 3 isn't it?  He's also behind ccz80.  I did mention to him when ccz80 came out that we have stuff like C++ containers in z88dk (vectors, linked lists, etc) that might fit will in a class language but he wasn't too interested.  I think he's more interested in keeping things simple and clean.



Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.900
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 1112
  • Likes Given: 1879
Re: the fast division
« Reply #11 on: 07:14, 21 November 15 »
I think because CPC BASIC 3 takes existing code and then transforms it into assembly, code is simply written in assembly and if it can optimise the code if it can be done. It's bounded in some extent to which maybe implemented into a future release, though it's simply gets by, by having no library, the assembly simply shows a whole line of Firmware the program addresses. I'm guessing half the battle with having the Firmware included means including the IndeX register along with it, which carries a compromise on Speed. CPC BASIC 3 simply gets around this by using the Registers necessarily for each Firmware address in it's direct approach. Small-C itself has an AMSDOS library (CPCIOLIB.C) which consists of lots of operations the firmware, though to get to that requires use of the Index Register.
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Offline litwr

  • CPC6128
  • ****
  • Posts: 162
  • Country: ru
    • lidovski's www page
  • Liked: 131
  • Likes Given: 131
Re: the fast division
« Reply #12 on: 20:59, 22 November 15 »
In C, 32/16 division does not occur because the compiler will promote the int to long type before the division.
It doesn't occur in the plain translation but it always possible to use optimization for the special cases.  ;)
It can be a lot faster - you can have a look at some benchmarking results at z88dk.org .  That program computes pi to 800 decimal digits using an algorithm that contains a lot of 32-bit divisions and some multiplies.  It's not the fastest algorithm for computing pi but it is really good for measuring 32-bit integer performance.
Are these results for Amstrad CPC6128?
I've just made an assembler program with Basic interface for AMSDOS.  It can calculate the number pi with up to 3000 digits and all digits are right.  It uses 32/16 unrolled division and the multiplication with 16 KB tables.   ;D It calculates 800 digits during 259 sec = 4 min 19 sec
However the multiplication is the minor time consumer.  The division takes most of the CPU cycles.
Just RUN"pi at the first time and just RUN for other times.
Should somebody announce this program as application?  :)
« Last Edit: 21:08, 22 November 15 by litwr »

Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #13 on: 00:19, 23 November 15 »
It doesn't occur in the plain translation but it always possible to use optimization for the special cases.  ;)

That's what the fast integer library does.  Step one is trying to reduce the division by byte-size increments so although a division might start at 32/32 at can be reduced all the way down to 8/8.

Quote
Are these results for Amstrad CPC6128?

They are for a generic 4MHz z80 target.  The timing is done by a program called "ticks" which simulates the z80 code and counts executed cycles exactly.  You get an exact execution time but it assumes there is no, eg, cycle stretching or wait states.  So on a CPC6128 you may need to derate those numbers a bit.

Quote
I've just made an assembler program with Basic interface for AMSDOS.  It can calculate the number pi with up to 3000 digits and all digits are right.  It uses 32/16 unrolled division and the multiplication with 16 KB tables.   ;D It calculates 800 digits during 259 sec = 4 min 19 sec
However the multiplication is the minor time consumer.  The division takes most of the CPU cycles.

I'm surprised actually.  But the 16k cost is very high :)

I ran the program again this time on max settings which includes unrolling.  The unrolling is always 8 times, no more.  The result I got was 5min23s.  This is in comparison to the not-unrolled version at 6min07s.  For those not following, the pi benchmark can be found here.  The executable size increased to 8822 bytes from 6154 bytes for the small integer divide so the cost of this unrolled fast division and the unrolled multiplication is about 2700 bytes.  Of the total size, 5602 bytes are for an array.

And the C code being timed is this:

Code: [Select]
int main()
{
static uint16_t r[2800 + 1];


static uint16_t i, k;
static uint16_t b;
static uint32_t d;
static uint16_t c;

static ldivu_t res;

#asm
ticks_start:
#endasm


c = 0;

for (i = 0; i < 2800; ++i)
r[i] = 2000;

for (k = 2800; k > 0; k -= 14)
{
d = 0;
i = k;

while (1)
{
d += (uint32_t)(r[i]) * 10000UL;
b = i * 2 - 1;


_ldivu__callee(&res,d,(uint32_t)(b));

r[i] = res.rem;
d = res.quot;

if (--i == 0) break;

d *= (uint32_t)(i);
}


_ldivu__callee(&res,d,10000UL);
c = res.rem;
}

#asm
ticks_end:
#endasm

return 0;
}

Timing is done between the labels "ticks_start" and "ticks_end".


z88dk's fast integer divide for 32/16 starts on line 420 of this file .  This is not the unrolled version which is further up as that one is hard to see what is going on.  The bonus of trying to reduce the division in the first step is that it is known that the dividend is exactly 32 bits (or more precisely 25-32) and the divisor is exactly 16 bits (or more precisely 9-16).  This means it is known the quotient will be 24 bits and the remainder 16 and you can write special code for that.  The first consequence is the loop is only iterated 24 times rather than 32 and the divide loop becomes this:

Code: [Select]

   ; general divide loop

loop_11:

   exx
   adc a,a
   adc hl,hl
   exx
   
   adc hl,hl
   jr c, loop_03

loop_01:

   sbc hl,de
   jr nc, loop_02
   add hl,de

loop_02:

   djnz loop_11
   
   ; hl = remainder, hl'a=~quotient with one more shift left

Just above this there is a short loop that runs to eliminate leading zero bits.  This is why I'm a bit surprised that your routine is faster :)  This optimization is done all the way down to 8/8 division but there is some overhead involved in reducing the division and shuffling dividend and divisior into appropriate registers which you don't have to worry about in an asm program using a single division routine.

It's a bit hard to read but the correct answer can be found at the top of this page:
libnew:examples:pi [z88dk]
« Last Edit: 00:33, 23 November 15 by Alcoholics Anonymous »

Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #14 on: 23:33, 29 November 15 »
Someone pointed out this website to me which may be of interest:

Math - z80 Heaven

You have to be careful though -- some of those routines are bugged and a few are relying on undocumented instructions that only work on nmos z80s (most z80 programmers don't realize many of the undocumented instructions stopped working after the z80 was made cmos and in chips derived from the original z80).


Offline litwr

  • CPC6128
  • ****
  • Posts: 162
  • Country: ru
    • lidovski's www page
  • Liked: 131
  • Likes Given: 131
Re: the fast division
« Reply #15 on: 17:24, 30 November 15 »
However they missed 32/16 divisions.  ;D  So does this cmos z80 miss instruction like LD IXL,A? They are the part of the Amstrad CPC documentation - it is very sad.  :(

Offline Alcoholics Anonymous

  • CPC664
  • ***
  • Posts: 51
  • Country: ca
  • Liked: 44
  • Likes Given: 0
Re: the fast division
« Reply #16 on: 20:27, 30 November 15 »
However they missed 32/16 divisions.  ;D  So does this cmos z80 miss instruction like LD IXL,A? They are the part of the Amstrad CPC documentation - it is very sad.  :(

Using the IX/IY halves is ok.  These instructions are also documented on some z80 derivatives.  The specific instruction I spotted was "sll" which does not work on all cmos z80s.

Offline litwr

  • CPC6128
  • ****
  • Posts: 162
  • Country: ru
    • lidovski's www page
  • Liked: 131
  • Likes Given: 131
Re: the fast division
« Reply #17 on: 08:42, 08 January 16 »
The code for the ulrafast division.
Code: [Select]
DIV320:; BCDE = HLDE/(-BC), HL = HLDE%(-BC)
   ld a,d
   add a,a
 adc hl,hl
 add hl,bc
 jr c, $+5
 sbc hl,bc

repeat 7
   adc a,a
 adc hl,hl
 add hl,bc
 jr c, $+5
 sbc hl,bc
endm

   adc a,a
   ld d,a
   ld a,e

   add a,a
 adc hl,hl
 add hl,bc
 jr c, $+5
 sbc hl,bc

repeat 7
   adc a,a
 adc hl,hl
 add hl,bc
 jr c, $+5
 sbc hl,bc
endm

   adc a,a
   ld e,a
   ld bc,0
   ret
It works only for BC < $8000 and HL < BC but it works very fast.  BTW I am not author of this code.
« Last Edit: 08:44, 08 January 16 by litwr »