CPCWiki forum

General Category => Programming => Topic started by: litwr on 09:07, 18 October 15

Title: the fast multiplication
Post by: litwr on 09:07, 18 October 15
I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80.  The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes.  Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic.  BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes.  However z80 has at least twice clocks.  ;D
The precalculated tables may increase the speed.  The best 6502's algorithm
base:seriously_fast_multiplication [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:seriously_fast_multiplication)
takes 47-51 cycles (49 on average) but it uses 2048 bytes in the tables!  So I try to use only 512 bytes in the tables and I've made the next code. It uses formula
x*y = ((x+y)/2)^2 - ((x-y)/2)^2, if x+y is even
= ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y, if x+y is odd and x>=y
Code: [Select]
`    org \$9c00start    ld a,101   ;for the tests    ld b,153    call mul8x8to16    ld (rhi),a    ld a,e    ld (rlo),a    retrlo  db 0rhi  db 0    org \$9d00sqrlo ;low(x*x)    db 0,1,4,9,\$10,\$19,\$24,\$31,\$40,\$51,\$64,\$79,\$90,\$a9,\$c4,\$e1    db 0,\$21,\$44,\$69,\$90,\$b9,\$e4,\$11,\$40,\$71,\$a4,\$d9,\$10,\$49,\$84,\$c1    db 0,\$41,\$84,\$c9,\$10,\$59,\$a4,\$f1,\$40,\$91,\$e4,\$39,\$90,\$e9,\$44,\$a1    db 0,\$61,\$c4,\$29,\$90,\$f9,\$64,\$d1,\$40,\$b1,\$24,\$99,\$10,\$89,4,\$81    db 0,\$81,4,\$89,\$10,\$99,\$24,\$b1,\$40,\$d1,\$64,\$f9,\$90,\$29,\$c4,\$61    db 0,\$a1,\$44,\$e9,\$90,\$39,\$e4,\$91,\$40,\$f1,\$a4,\$59,\$10,\$c9,\$84,\$41    db 0,\$c1,\$84,\$49,\$10,\$d9,\$a4,\$71,\$40,\$11,\$e4,\$b9,\$90,\$69,\$44,\$21    db 0,\$e1,\$c4,\$a9,\$90,\$79,\$64,\$51,\$40,\$31,\$24,\$19,\$10,9,4,\$1    db 0,1,4,9,\$10,\$19,\$24,\$31,\$40,\$51,\$64,\$79,\$90,\$a9,\$c4,\$e1    db 0,\$21,\$44,\$69,\$90,\$b9,\$e4,\$11,\$40,\$71,\$a4,\$d9,\$10,\$49,\$84,\$c1    db 0,\$41,\$84,\$c9,\$10,\$59,\$a4,\$f1,\$40,\$91,\$e4,\$39,\$90,\$e9,\$44,\$a1    db 0,\$61,\$c4,\$29,\$90,\$f9,\$64,\$d1,\$40,\$b1,\$24,\$99,\$10,\$89,4,\$81    db 0,\$81,4,\$89,\$10,\$99,\$24,\$b1,\$40,\$d1,\$64,\$f9,\$90,\$29,\$c4,\$61    db 0,\$a1,\$44,\$e9,\$90,\$39,\$e4,\$91,\$40,\$f1,\$a4,\$59,\$10,\$c9,\$84,\$41    db 0,\$c1,\$84,\$49,\$10,\$d9,\$a4,\$71,\$40,\$11,\$e4,\$b9,\$90,\$69,\$44,\$21    db 0,\$e1,\$c4,\$a9,\$90,\$79,\$64,\$51,\$40,\$31,\$24,\$19,\$10,9,4,\$1sqrhi ;high(x*x)    db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0    db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3    db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8    db 9,9,9,\$a,\$a,\$a,\$b,\$b,\$c,\$c,\$d,\$d,\$e,\$e,\$f,\$f    db \$10,\$10,\$11,\$11,\$12,\$12,\$13,\$13,\$14,\$14,\$15,\$15,\$16,\$17,\$17,\$18    db \$19,\$19,\$1a,\$1a,\$1b,\$1c,\$1c,\$1d,\$1e,\$1e,\$1f,\$20,\$21,\$21,\$22,\$23    db \$24,\$24,\$25,\$26,\$27,\$27,\$28,\$29,\$2a,\$2b,\$2b,\$2c,\$2d,\$2e,\$2f,\$30    db \$31,\$31,\$32,\$33,\$34,\$35,\$36,\$37,\$38,\$39,\$3a,\$3b,\$3c,\$3d,\$3e,\$3f    db \$40,\$41,\$42,\$43,\$44,\$45,\$46,\$47,\$48,\$49,\$4a,\$4b,\$4c,\$4d,\$4e,\$4f    db \$51,\$52,\$53,\$54,\$55,\$56,\$57,\$59,\$5a,\$5b,\$5c,\$5d,\$5f,\$60,\$61,\$62    db \$64,\$65,\$66,\$67,\$69,\$6a,\$6b,\$6c,\$6e,\$6f,\$70,\$72,\$73,\$74,\$76,\$77    db \$79,\$7a,\$7b,\$7d,\$7e,\$7f,\$81,\$82,\$84,\$85,\$87,\$88,\$8a,\$8b,\$8d,\$8e    db \$90,\$91,\$93,\$94,\$96,\$97,\$99,\$9a,\$9c,\$9d,\$9f,\$a0,\$a2,\$a4,\$a5,\$a7    db \$a9,\$aa,\$ac,\$ad,\$af,\$b1,\$b2,\$b4,\$b6,\$b7,\$b9,\$bb,\$bd,\$be,\$c0,\$c2    db \$c4,\$c5,\$c7,\$c9,\$cb,\$cc,\$ce,\$d0,\$d2,\$d4,\$d5,\$d7,\$d9,\$db,\$dd,\$df    db \$e1,\$e2,\$e4,\$e6,\$e8,\$ea,\$ec,\$ee,\$f0,\$f2,\$f4,\$f6,\$f8,\$fa,\$fc,\$femul8x8to16 proc   ;IN: A, B;  OUT: A (HIGH), E (LOW); 44 bytes; 136-172 cycles on CPC6128 (154 on average)    local l1,l2    cp b    jr nc,l1    ld e,a    ld a,b    ld b,el1  ld c,a    sub b    rra    ld d,a    ld a,c    add a,b    rra    ld l,a    ld h,high(sqrlo)    ld a,(hl)    ld e,l    ld l,d    jr nc,l2    sub (hl)   ;odd    ld l,e    ld e,a    inc h       ;sets high(sqrhi)    ld a,(hl)    ld l,d    sbc a,(hl)    ld d,a    ld a,e    add a,b    ld e,a    ld a,d    adc a,0    retl2  sub (hl)   ;even    ld l,e    ld e,a    inc h    ld a,(hl)    ld l,d    sbc a,(hl)    ret    endp`The calculations takes 136-172 cycles on CPC6128 (154 on average).  Its 6502 equivalent has 67.5 clocks on average.  I'm not sure that my z80 code is perfect - I couldn't use 16-bit operations.  Any suggestions, improvements will be pleased. :-)
BTW I couldn't find fast multiplication algorithms with tables for z80...
Title: Re: the fast multiplication
Post by: Axelay on 10:29, 18 October 15
BTW I couldn't find fast multiplication algorithms with tables for z80...

Title: Re: the fast multiplication
Post by: litwr on 13:26, 18 October 15
Thank you!  I begin to edit this wiki.  I've just added information about clocks and size to the several routines.  My routine is similar to
Programming:Integer Multiplication - CPCWiki (http://cpcwiki.eu/index.php/Programming:Integer_Multiplication#Very_fast_8bit_.2A_8bit_Unsigned_with_only_1K_of_tables)
it is slower (160 vs 124 cycles), cannot naturally be converted to a macro, but it uses only 512 bytes for the tables.
Should I add my routine to the wiki page?
Title: Re: the fast multiplication
Post by: AMSDOS on 11:03, 19 October 15

So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.

For Simple 8-bit multiplication:

Code: [Select]
`; Unsigned 8 bit multiplication.begin          equ &7000.top          equ begin+&100.number          equ top.mult          equ top+1.prod          equ top+2          org begin          ld a,(number)          ld e,a          ld a,(mult)          ld hl,0          ld d,l          ld b,8.shift          add hl,hl          rla          jr nc,over          add hl,de.over          djnz shift          ld (prod),hl          ret`

Entry : two 8bit numbers in &7100, &7101
Result in &7102

Code: [Select]
`; Unsigned 16bit multiplication.begin          equ &7000.top          equ begin+&100.number          equ top.mult          equ top+2.prod          equ top+4          org begin          sub a          ld c,a          ld e,a          ld d,a          ld b,16          ld hl,(mult).shift          sla e          rl d          rl c          rla          add hl,hl          jr nc,over          push hl          ld hl,(number)          add hl,de          ex de,hl          pop hl          jr nc,over          inc c          jr nz,over          inc a.over          djnz shift          ld b,a          ld (prod),de          ld (prod+2),bc          ret`

Entry : two 16bit numbers in &7100, &7101 and &7102, &7103
Result in &7104, &7105

Code: [Select]
`;; Signed 16 bit multiplication.begin          equ &7000.top          equ begin+&100.number          equ top.mult          equ top+2.prod          equ top+4.flag          equ top+6          org begin          ld c,0          ld hl,(number)          bit 7,h          jr z,second          inc c          ex de,hl          ld hl,0          and a          sbc hl,de          ld (number),hl.second          ld hl,(mult)          bit 7,h          jr z,start          dec c          ex de,hl          ld hl,0          and a          sbc hl,de.start          ld de,0          ld b,16.shift          sla e          rl d          bit 7,d          jr nz,error          add hl,hl          jr nc,over          ld a,(number)          add a,e          ld e,a          ld a,(number+1)          adc a,d          ld d,a          bit 7,d          jr nz,error.over          djnz shift          ld a,c          and a          jr z,plus          ld hl,0          sbc hl,de          ex de,hl.plus          ld (prod),de          sub a.finish          ld (flag),a          ret.error          ld a,&ff          jr finish`Entry : two 16bit numbers in &7100, &7101 and &7102, &7103
Result in &7104, &7105
Title: Re: the fast multiplication
Post by: litwr on 17:13, 19 October 15
So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.
Amstrad wiki pages already contain the improved versions at least for unsigned multiplication.  ;) The mentioned pages contain perfect or close to perfect algorithms. I did not check the signed variants yet.

Title: Re: the fast multiplication
Post by: Alcoholics Anonymous on 22:05, 19 October 15
I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80.  The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes.  Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic.  BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes.  However z80 has at least twice clocks.  ;D

When you go to larger multiplies there are bigger gaps in performance.

You can see some of the small multiplication and division routines we use in z88dk here:
SourceForge.net Repository - [z88dk] Index of (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/)

We go as high as 32 bit by 32 bit into a 64 bit result since sdcc (a c compiler) supports longlong types which are 64-bit integers.  All these routines keep operands in registers which makes them fast and compact.

We also do a fast integer library which can be seen here:
SourceForge.net Repository - [z88dk] Index of (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/fast/)

The code is necessarily messy so it can be hard to follow but it does several things which can be enabled via options:

* It tries to reduce multiplications and divisions.  It does this by testing leading bytes for zero and jumps to a faster operation taking smaller arguments if it can.
* It tests for leading zero bits and runs a reduced loop if it finds them.  If step 1 above passes, operands can have up to 7 leading zero bits
* It enters the loop with values set up as if one iteration has already occurred.
* Loops can be unrolled.  But I wouldn't do this in typical code as it greatly increases code size.

Using this fast integer library we were able to run a program that computed pi almost three times faster than the same program using the small integer library (we didn't run with loop unrolling enabled as we don't regard that as a realistic option for most programs; incidentally the sdcc times are in combination with the z88dk library -- sdcc on its own is several times slower as most of its library is written in C, including 32-bit math).

libnew:examples:pi [z88dk] (http://www.z88dk.org/wiki/doku.php?id=libnew:examples:pi)

Quote
BTW I couldn't find fast multiplication algorithms with tables for z80...

The tables are too expensive.  I can see your code is not optimal so I may have a go later if someone else doesn't sooner :)

Quote
So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.

If it's got ix in it for 16-bit math and it's using statics, it's not fast :)  Also there's no difference between signed and unsigned multiplication unless you are checking for overflow.

The small 16x16 multiply used in z88dk is this one:
http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/l_small_mul_16_16x16.asm?revision=1.2&content-type=text%2Fplain (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/l_small_mul_16_16x16.asm?revision=1.2&content-type=text%2Fplain)

The main difference is it's trying to reduce the multiplication to 16x8 and if it can this means the loop only has to be executed 8 times rather than 16.  This speeds up common cases quite a bit.  sdcc does something similar for 16x16 multiply.

The main loop:

Code: [Select]
`loop_0:    ; ac = 16-bit multiplicand    ; de = 16-bit multiplicand    ;  b = iterations    add hl,hl    rl c    rla    jr nc, loop_1    add hl,de loop_1:    djnz loop_0`
is a little bit less optimal than the most common 16x16s which replace "rl c; rla" with a single cycle shorter 16-bit add but this loop allows the 16x8 optimization.
Title: Re: the fast multiplication
Post by: litwr on 23:57, 19 October 15
Quote
We go as high as 32 bit by 32 bit into a 64 bit result
Sorry, I missed this. 64-bit arithmetic with z80 - It is really astonishing.  :o The pi computation...  :picard:
Quote
I can see your code is not optimal so I may have a go later if someone else doesn't sooner
Thank you in advance.  :)  However do you know the published code which is faster than mine and using only 512 bytes in the tables?  ;)
Title: Re: the fast multiplication
Post by: Urusergi on 02:55, 22 October 15
Code: [Select]
`l1  srl a    ld d,a    ld a,c    add a,b    rr a    ld l,a    ld h,high(sqrlo)    ld a,(hl)    ld e,l    ld l,d    jr nc,l2`
If I remember correctly "rra" is smaller and faster than "rr a"  ;)

and in this case I think "srl a" can be replaced with "rra" safely

Title: Re: the fast multiplication
Post by: Executioner on 04:25, 22 October 15
I've just added information about clocks and size to the several routines.

Those cycles don't appear to relate to the CPC timings for the routines. Most of the routines I've included there have microsecond timing in-line with the code. If you're going to put the T-States, please label it as T-States (non-CPC timing).
Title: Re: the fast multiplication
Post by: Fessor on 11:53, 22 October 15
Code: [Select]
`l1  srl a    ld d,a    ld a,c    add a,b    rr a    ld l,a    ld h,high(sqrlo)    ld a,(hl)    ld e,l    ld l,d    jr nc,l2`
If I remember correctly "rra" is smaller and faster than "rr a"  ;)

and in this case I think "srl a" can be replaced with "rra" safely

beware of the carry flag as its rotates into the 7th bit...
Title: Re: the fast multiplication
Post by: Urusergi on 17:31, 22 October 15
beware of the carry flag as its rotates into the 7th bit...

Yes, you are right! the instruction "neg" affects the carry flag (is "high" most of the time)

a possible solution could be:
Code: [Select]
`      CP B      JR NC,L1      LD E,A      LD A,B      LD B,E      ;SWAP A<->BL1    LD C,A      SUB B      RRA      LD D,A      LD A,C      ADD A,B      RRA      ....`[/code]
Title: Re: the fast multiplication
Post by: litwr on 23:19, 22 October 15
Quote
If I remember correctly "rra" is smaller and faster than "rr a"  ;)
Yes, it looks like my typo - it is fixed.  So the routine takes one byte and 4 cycles less.  ;D Thanks.  :)
Quote
a possible solution could be:
It is two bytes shorter and takes 4 cycles less for the case A<B and the same cycles if A>=B.  So it gives 2 cycles less on average!  Thanks again!  :)
The routine occupies only 44 bytes and executes during 136-172 cycles (154 on average) now!  :) I've just fixed the code in the beginning of this topic.  Is it ready for CPCwiki?  ;)
Quote
Those cycles don't appear to relate to the CPC timings for the routines. Most of the routines I've included there have microsecond timing in-line with the code. If you're going to put the T-States, please label it as T-States (non-CPC timing).
Sorry, maybe I do not understand something.  :( IMHO CPC has 4MHz CPU and every instruction takes 4*n cycles = n microseconds.   I didn't use microseconds because it is specific for CPC only.  For example, 4 CPU cycles (T-states) of z80 corresponds to 4 CPU cycles of 8088, 6502, ... My point is to provide data which are easier to compare with other architectures.  If you want you may change units.  I only hope that with any units is better than total absence of data.
Title: Re: the fast multiplication
Post by: andycadley on 23:35, 22 October 15
Sorry, maybe I do not understand something.  :( IMHO CPC has 4MHz CPU and every instruction takes 4*n cycles = n microseconds.   I didn't use microseconds because it is specific for CPC only.  For example, 4 CPU cycles (T-states) of z80 corresponds to 4 CPU cycles of 8088, 6502, ... My point is to provide data which are easier to compare with other architectures.  If you want you may change units.  I only hope that with any units is better than total absence of data.

The problem is that T-states are all but useless as a measure of timing on the CPC, because of the way it's contention model stretches out certain instructions (so, for example, instructions that officially take 6 or 7 T-states may actually last 8us). This can alter the timing such that a routine that is "optimal" based on generic Z80 timings is sub-optimal on the CPC and vice versa.

Because all instructions end up as multiples of 4 T-states, it's more common on the CPC to measure timings directly in us (or NOP cycles as they're commonly known)
Title: Re: the fast multiplication
Post by: litwr on 10:55, 24 October 15
I agree T-states is too z80 internal so it is better to use the system cycles which is given by the system quartz.  For CPC/PCW 4 cycles = 1 usec.  I've added timing to multiplication wiki-page.  I've also added my multiplication routine.  ;)
Title: Re: the fast multiplication
Post by: TFM on 20:20, 27 October 15
On the CPC everything is a multiple of 1 us. So yes, let's just use microseconds and it's all good.  :)
Title: Re: the fast multiplication
Post by: Ast on 23:01, 28 October 15
I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80.  The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes.  Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic.  BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes.  However z80 has at least twice clocks.  ;D
The precalculated tables may increase the speed.  The best 6502's algorithm
base:seriously_fast_multiplication [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:seriously_fast_multiplication)
takes 47-51 cycles (49 on average) but it uses 2048 bytes in the tables!  So I try to use only 512 bytes in the tables and I've made the next code. It uses formula
x*y = ((x+y)/2)^2 - ((x-y)/2)^2, if x+y is even
= ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y, if x+y is odd and x>=y
Code: [Select]
`    org \$9c00start    ld a,101   ;for the tests    ld b,153    call mul8x8to16    ld (rhi),a    ld a,e    ld (rlo),a    retrlo  db 0rhi  db 0    org \$9d00sqrlo ;low(x*x)    db 0,1,4,9,\$10,\$19,\$24,\$31,\$40,\$51,\$64,\$79,\$90,\$a9,\$c4,\$e1    db 0,\$21,\$44,\$69,\$90,\$b9,\$e4,\$11,\$40,\$71,\$a4,\$d9,\$10,\$49,\$84,\$c1    db 0,\$41,\$84,\$c9,\$10,\$59,\$a4,\$f1,\$40,\$91,\$e4,\$39,\$90,\$e9,\$44,\$a1    db 0,\$61,\$c4,\$29,\$90,\$f9,\$64,\$d1,\$40,\$b1,\$24,\$99,\$10,\$89,4,\$81    db 0,\$81,4,\$89,\$10,\$99,\$24,\$b1,\$40,\$d1,\$64,\$f9,\$90,\$29,\$c4,\$61    db 0,\$a1,\$44,\$e9,\$90,\$39,\$e4,\$91,\$40,\$f1,\$a4,\$59,\$10,\$c9,\$84,\$41    db 0,\$c1,\$84,\$49,\$10,\$d9,\$a4,\$71,\$40,\$11,\$e4,\$b9,\$90,\$69,\$44,\$21    db 0,\$e1,\$c4,\$a9,\$90,\$79,\$64,\$51,\$40,\$31,\$24,\$19,\$10,9,4,\$1    db 0,1,4,9,\$10,\$19,\$24,\$31,\$40,\$51,\$64,\$79,\$90,\$a9,\$c4,\$e1    db 0,\$21,\$44,\$69,\$90,\$b9,\$e4,\$11,\$40,\$71,\$a4,\$d9,\$10,\$49,\$84,\$c1    db 0,\$41,\$84,\$c9,\$10,\$59,\$a4,\$f1,\$40,\$91,\$e4,\$39,\$90,\$e9,\$44,\$a1    db 0,\$61,\$c4,\$29,\$90,\$f9,\$64,\$d1,\$40,\$b1,\$24,\$99,\$10,\$89,4,\$81    db 0,\$81,4,\$89,\$10,\$99,\$24,\$b1,\$40,\$d1,\$64,\$f9,\$90,\$29,\$c4,\$61    db 0,\$a1,\$44,\$e9,\$90,\$39,\$e4,\$91,\$40,\$f1,\$a4,\$59,\$10,\$c9,\$84,\$41    db 0,\$c1,\$84,\$49,\$10,\$d9,\$a4,\$71,\$40,\$11,\$e4,\$b9,\$90,\$69,\$44,\$21    db 0,\$e1,\$c4,\$a9,\$90,\$79,\$64,\$51,\$40,\$31,\$24,\$19,\$10,9,4,\$1sqrhi ;high(x*x)    db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0    db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3    db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8    db 9,9,9,\$a,\$a,\$a,\$b,\$b,\$c,\$c,\$d,\$d,\$e,\$e,\$f,\$f    db \$10,\$10,\$11,\$11,\$12,\$12,\$13,\$13,\$14,\$14,\$15,\$15,\$16,\$17,\$17,\$18    db \$19,\$19,\$1a,\$1a,\$1b,\$1c,\$1c,\$1d,\$1e,\$1e,\$1f,\$20,\$21,\$21,\$22,\$23    db \$24,\$24,\$25,\$26,\$27,\$27,\$28,\$29,\$2a,\$2b,\$2b,\$2c,\$2d,\$2e,\$2f,\$30    db \$31,\$31,\$32,\$33,\$34,\$35,\$36,\$37,\$38,\$39,\$3a,\$3b,\$3c,\$3d,\$3e,\$3f    db \$40,\$41,\$42,\$43,\$44,\$45,\$46,\$47,\$48,\$49,\$4a,\$4b,\$4c,\$4d,\$4e,\$4f    db \$51,\$52,\$53,\$54,\$55,\$56,\$57,\$59,\$5a,\$5b,\$5c,\$5d,\$5f,\$60,\$61,\$62    db \$64,\$65,\$66,\$67,\$69,\$6a,\$6b,\$6c,\$6e,\$6f,\$70,\$72,\$73,\$74,\$76,\$77    db \$79,\$7a,\$7b,\$7d,\$7e,\$7f,\$81,\$82,\$84,\$85,\$87,\$88,\$8a,\$8b,\$8d,\$8e    db \$90,\$91,\$93,\$94,\$96,\$97,\$99,\$9a,\$9c,\$9d,\$9f,\$a0,\$a2,\$a4,\$a5,\$a7    db \$a9,\$aa,\$ac,\$ad,\$af,\$b1,\$b2,\$b4,\$b6,\$b7,\$b9,\$bb,\$bd,\$be,\$c0,\$c2    db \$c4,\$c5,\$c7,\$c9,\$cb,\$cc,\$ce,\$d0,\$d2,\$d4,\$d5,\$d7,\$d9,\$db,\$dd,\$df    db \$e1,\$e2,\$e4,\$e6,\$e8,\$ea,\$ec,\$ee,\$f0,\$f2,\$f4,\$f6,\$f8,\$fa,\$fc,\$femul8x8to16 proc   ;IN: A, B;  OUT: A (HIGH), E (LOW); 44 bytes; 136-172 cycles on CPC6128 (154 on average)    local l1,l2    cp b    jr nc,l1    ld e,a    ld a,b    ld b,el1  ld c,a    sub b    rra    ld d,a    ld a,c    add a,b    rra    ld l,a    ld h,high(sqrlo)    ld a,(hl)    ld e,l    ld l,d    jr nc,l2    sub (hl)   ;odd    ld l,e    ld e,a    inc h       ;sets high(sqrhi)    ld a,(hl)    ld l,d    sbc a,(hl)    ld d,a    ld a,e    add a,b    ld e,a    ld a,d    adc a,0    retl2  sub (hl)   ;even    ld l,e    ld e,a    inc h    ld a,(hl)    ld l,d    sbc a,(hl)    ret    endp`The calculations takes 136-172 cycles on CPC6128 (154 on average).  Its 6502 equivalent has 67.5 clocks on average.  I'm not sure that my z80 code is perfect - I couldn't use 16-bit operations.  Any suggestions, improvements will be pleased. :-)
BTW I couldn't find fast multiplication algorithms with tables for z80...
I like your work. Well done!
Title: Re: the fast multiplication
Post by: litwr on 20:20, 13 November 15
Thanks.  :)
I am starting to seek a way to the fastest possible 16x16=32 multiplication.  I am using Programming:Integer Multiplication - CPCWiki (http://cpcwiki.eu/index.php/Programming:Integer_Multiplication#Fastest.2C_accurate_8bit_.2A_8bit_Unsigned_with_16_KB_tables) as the base.  So I can get
Code: [Select]
`product dw 0,0mul16x16 proc    ;ix,iy  -> a,hl,product         ld c,ixl         ld a,iyl         ld l,a         mul8x8   ;l*c -> de, uses bc,hl,a         ld (product),de         ld c,ixh         ld a,iyh         ld l,a         mul8x8         ld (product+2),de         ld c,ixh         ld a,iyl         ld l,a         mul8x8         ld hl,(product+1)         add hl,de         push hl         ld a,(product+3)         adc a,0         ld ixh,a         ld c,ixl         ld a,iyh         ld l,a         mul8x8         pop hl         xor a         add hl,de         adc a,ixh         ret         endp`It takes 548 cycles or 137 usec plus RET.  Could anybody try to help to find something better?  ;) BTW CPCwiki missed 16x16->32 multiplication.  :(