Print Page - the fast multiplication

Title: the fast multiplication
Post by: litwr on 07:07, 18 October 15

I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80. The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes. Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic. BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes. However z80 has at least twice clocks. ;D
The precalculated tables may increase the speed. The best 6502's algorithm
base:seriously_fast_multiplication [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:seriously_fast_multiplication)
takes 47-51 cycles (49 on average) but it uses 2048 bytes in the tables! So I try to use only 512 bytes in the tables and I've made the next code. It uses formula
x*y = ((x+y)/2)^2 - ((x-y)/2)^2, if x+y is even
= ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y, if x+y is odd and x>=y

Code Select


    org $9c00

start
    ld a,101   ;for the tests
    ld b,153
    call mul8x8to16
    ld (rhi),a
    ld a,e
    ld (rlo),a
    ret

rlo  db 0
rhi  db 0

    org $9d00
sqrlo ;low(x*x)
    db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1
    db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1
    db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1
    db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81
    db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61
    db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41
    db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21
    db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1
    db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1
    db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1
    db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1
    db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81
    db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61
    db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41
    db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21
    db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1
sqrhi ;high(x*x)
    db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
    db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3
    db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8
    db 9,9,9,$a,$a,$a,$b,$b,$c,$c,$d,$d,$e,$e,$f,$f
    db $10,$10,$11,$11,$12,$12,$13,$13,$14,$14,$15,$15,$16,$17,$17,$18
    db $19,$19,$1a,$1a,$1b,$1c,$1c,$1d,$1e,$1e,$1f,$20,$21,$21,$22,$23
    db $24,$24,$25,$26,$27,$27,$28,$29,$2a,$2b,$2b,$2c,$2d,$2e,$2f,$30
    db $31,$31,$32,$33,$34,$35,$36,$37,$38,$39,$3a,$3b,$3c,$3d,$3e,$3f
    db $40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$4a,$4b,$4c,$4d,$4e,$4f
    db $51,$52,$53,$54,$55,$56,$57,$59,$5a,$5b,$5c,$5d,$5f,$60,$61,$62
    db $64,$65,$66,$67,$69,$6a,$6b,$6c,$6e,$6f,$70,$72,$73,$74,$76,$77
    db $79,$7a,$7b,$7d,$7e,$7f,$81,$82,$84,$85,$87,$88,$8a,$8b,$8d,$8e
    db $90,$91,$93,$94,$96,$97,$99,$9a,$9c,$9d,$9f,$a0,$a2,$a4,$a5,$a7
    db $a9,$aa,$ac,$ad,$af,$b1,$b2,$b4,$b6,$b7,$b9,$bb,$bd,$be,$c0,$c2
    db $c4,$c5,$c7,$c9,$cb,$cc,$ce,$d0,$d2,$d4,$d5,$d7,$d9,$db,$dd,$df
    db $e1,$e2,$e4,$e6,$e8,$ea,$ec,$ee,$f0,$f2,$f4,$f6,$f8,$fa,$fc,$fe

mul8x8to16 proc   ;IN: A, B;  OUT: A (HIGH), E (LOW); 44 bytes; 136-172 cycles on CPC6128 (154 on average)
    local l1,l2

    cp b
    jr nc,l1


    ld e,a
    ld a,b
    ld b,e


l1  ld c,a
    sub b
    rra    ld d,a
    ld a,c
    add a,b
    rra
    ld l,a
    ld h,high(sqrlo)
    ld a,(hl)
    ld e,l
    ld l,d
    jr nc,l2

    sub (hl)   ;odd
    ld l,e
    ld e,a
    inc h       ;sets high(sqrhi)
    ld a,(hl)
    ld l,d
    sbc a,(hl)
    ld d,a

    ld a,e
    add a,b
    ld e,a
    ld a,d
    adc a,0
    ret

l2  sub (hl)   ;even
    ld l,e
    ld e,a
    inc h
    ld a,(hl)
    ld l,d
    sbc a,(hl)
    ret
    endp

The calculations takes 136-172 cycles on CPC6128 (154 on average). Its 6502 equivalent has 67.5 clocks on average. I'm not sure that my z80 code is perfect - I couldn't use 16-bit operations. Any suggestions, improvements will be pleased. :-)
BTW I couldn't find fast multiplication algorithms with tables for z80...

Title: Re: the fast multiplication
Post by: Axelay on 08:29, 18 October 15

Quote from: litwr on 07:07, 18 October 15
BTW I couldn't find fast multiplication algorithms with tables for z80...

Have you checked out this page (http://cpcwiki.eu/index.php/Programming:Integer_Multiplication)?

Title: Re: the fast multiplication
Post by: litwr on 11:26, 18 October 15

Thank you! I begin to edit this wiki. I've just added information about clocks and size to the several routines. My routine is similar to
Programming:Integer Multiplication - CPCWiki (http://cpcwiki.eu/index.php/Programming:Integer_Multiplication#Very_fast_8bit_.2A_8bit_Unsigned_with_only_1K_of_tables)
it is slower (160 vs 124 cycles), cannot naturally be converted to a macro, but it uses only 512 bytes for the tables.
Should I add my routine to the wiki page?

Title: Re: the fast multiplication
Post by: AMSDOS on 09:03, 19 October 15

So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.

For Simple 8-bit multiplication:

Code Select


; Unsigned 8 bit multiplication


.begin
          equ &7000
.top
          equ begin+&100
.number
          equ top
.mult
          equ top+1
.prod
          equ top+2
          org begin
          ld a,(number)
          ld e,a
          ld a,(mult)
          ld hl,0
          ld d,l
          ld b,8
.shift
          add hl,hl
          rla
          jr nc,over
          add hl,de
.over
          djnz shift
          ld (prod),hl
          ret

Entry : two 8bit numbers in &7100, &7101
Result in &7102

Code Select


; Unsigned 16bit multiplication


.begin
          equ &7000
.top
          equ begin+&100
.number
          equ top
.mult
          equ top+2
.prod
          equ top+4
          org begin
          sub a
          ld c,a
          ld e,a
          ld d,a
          ld b,16
          ld hl,(mult)
.shift
          sla e
          rl d
          rl c
          rla
          add hl,hl
          jr nc,over
          push hl
          ld hl,(number)
          add hl,de
          ex de,hl
          pop hl
          jr nc,over
          inc c
          jr nz,over
          inc a
.over
          djnz shift
          ld b,a
          ld (prod),de
          ld (prod+2),bc
          ret

Entry : two 16bit numbers in &7100, &7101 and &7102, &7103
Result in &7104, &7105

Code Select


;; Signed 16 bit multiplication


.begin
          equ &7000
.top
          equ begin+&100
.number
          equ top
.mult
          equ top+2
.prod
          equ top+4
.flag
          equ top+6
          org begin
          ld c,0
          ld hl,(number)
          bit 7,h
          jr z,second
          inc c
          ex de,hl
          ld hl,0
          and a
          sbc hl,de
          ld (number),hl
.second
          ld hl,(mult)
          bit 7,h
          jr z,start
          dec c
          ex de,hl
          ld hl,0
          and a
          sbc hl,de
.start
          ld de,0
          ld b,16
.shift
          sla e
          rl d
          bit 7,d
          jr nz,error
          add hl,hl
          jr nc,over
          ld a,(number)
          add a,e
          ld e,a
          ld a,(number+1)
          adc a,d
          ld d,a
          bit 7,d
          jr nz,error
.over
          djnz shift
          ld a,c
          and a
          jr z,plus
          ld hl,0
          sbc hl,de
          ex de,hl
.plus
          ld (prod),de
          sub a
.finish
          ld (flag),a
          ret
.error
          ld a,&ff
          jr finish

Entry : two 16bit numbers in &7100, &7101 and &7102, &7103
Result in &7104, &7105

Title: Re: the fast multiplication
Post by: litwr on 15:13, 19 October 15

Quote from: AMSDOS on 09:03, 19 October 15
So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.

Amstrad wiki pages already contain the improved versions at least for unsigned multiplication. ;) The mentioned pages contain perfect or close to perfect algorithms. I did not check the signed variants yet.

Title: Re: the fast multiplication
Post by: Alcoholics Anonymous on 20:05, 19 October 15

Quote from: litwr on 07:07, 18 October 15
I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80. The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes. Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic. BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes. However z80 has at least twice clocks. ;D

When you go to larger multiplies there are bigger gaps in performance.

You can see some of the small multiplication and division routines we use in z88dk here:
SourceForge.net Repository - [z88dk] Index of (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/)

We go as high as 32 bit by 32 bit into a 64 bit result since sdcc (a c compiler) supports longlong types which are 64-bit integers. All these routines keep operands in registers which makes them fast and compact.

We also do a fast integer library which can be seen here:
SourceForge.net Repository - [z88dk] Index of (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/fast/)

The code is necessarily messy so it can be hard to follow but it does several things which can be enabled via options:

* It tries to reduce multiplications and divisions. It does this by testing leading bytes for zero and jumps to a faster operation taking smaller arguments if it can.
* It tests for leading zero bits and runs a reduced loop if it finds them. If step 1 above passes, operands can have up to 7 leading zero bits
* It enters the loop with values set up as if one iteration has already occurred.
* Loops can be unrolled. But I wouldn't do this in typical code as it greatly increases code size.

Using this fast integer library we were able to run a program that computed pi almost three times faster than the same program using the small integer library (we didn't run with loop unrolling enabled as we don't regard that as a realistic option for most programs; incidentally the sdcc times are in combination with the z88dk library -- sdcc on its own is several times slower as most of its library is written in C, including 32-bit math).

libnew:examples:pi [z88dk] (http://www.z88dk.org/wiki/doku.php?id=libnew:examples:pi)

Quote
BTW I couldn't find fast multiplication algorithms with tables for z80...

The tables are too expensive. I can see your code is not optimal so I may have a go later if someone else doesn't sooner :)

Quote
So the book I have has a range of Multiplication routines depending on what one needs. It was written back in 1985/6, so they may have been improved upon.

If it's got ix in it for 16-bit math and it's using statics, it's not fast :) Also there's no difference between signed and unsigned multiplication unless you are checking for overflow.

The small 16x16 multiply used in z88dk is this one:
http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/l_small_mul_16_16x16.asm?revision=1.2&content-type=text%2Fplain (http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/math/integer/small/l_small_mul_16_16x16.asm?revision=1.2&content-type=text%2Fplain)

The main difference is it's trying to reduce the multiplication to 16x8 and if it can this means the loop only has to be executed 8 times rather than 16. This speeds up common cases quite a bit. sdcc does something similar for 16x16 multiply.

The main loop:

Code Select


loop_0: 

   ; ac = 16-bit multiplicand 
   ; de = 16-bit multiplicand 
   ;  b = iterations 

   add hl,hl 
   rl c 
   rla 
   jr nc, loop_1 
   add hl,de 

loop_1: 

   djnz loop_0

is a little bit less optimal than the most common 16x16s which replace "rl c; rla" with a single cycle shorter 16-bit add but this loop allows the 16x8 optimization.

Title: Re: the fast multiplication
Post by: litwr on 21:57, 19 October 15

QuoteWe go as high as 32 bit by 32 bit into a 64 bit result

Sorry, I missed this. 64-bit arithmetic with z80 - It is really astonishing. :o The pi computation... :picard:

QuoteI can see your code is not optimal so I may have a go later if someone else doesn't sooner

Thank you in advance. :) However do you know the published code which is faster than mine and using only 512 bytes in the tables? ;)

Title: Re: the fast multiplication
Post by: Urusergi on 00:55, 22 October 15

Code Select


l1  srl a
    ld d,a
    ld a,c
    add a,b
    rr a
    ld l,a
    ld h,high(sqrlo)
    ld a,(hl)
    ld e,l
    ld l,d
    jr nc,l2

If I remember correctly "rra" is smaller and faster than "rr a" ;)

and in this case I think "srl a" can be replaced with "rra" safely

Title: Re: the fast multiplication
Post by: Executioner on 02:25, 22 October 15

Quote from: litwr on 11:26, 18 October 15
I've just added information about clocks and size to the several routines.

Those cycles don't appear to relate to the CPC timings for the routines. Most of the routines I've included there have microsecond timing in-line with the code. If you're going to put the T-States, please label it as T-States (non-CPC timing).

Title: Re: the fast multiplication
Post by: Fessor on 09:53, 22 October 15

Quote from: Urusergi on 00:55, 22 October 15
Code Select Expand
l1 srl a ld d,a ld a,c add a,b rr a ld l,a ld h,high(sqrlo) ld a,(hl) ld e,l ld l,d jr nc,l2

If I remember correctly "rra" is smaller and faster than "rr a" ;)

and in this case I think "srl a" can be replaced with "rra" safely

beware of the carry flag as its rotates into the 7th bit...

Title: Re: the fast multiplication
Post by: Urusergi on 15:31, 22 October 15

Quote from: Fessor on 09:53, 22 October 15
beware of the carry flag as its rotates into the 7th bit...

Yes, you are right! the instruction "neg" affects the carry flag (is "high" most of the time)

a possible solution could be:

Code Select


      CP B
      JR NC,L1

      LD E,A
      LD A,B
      LD B,E      ;SWAP A<->B

L1    LD C,A
      SUB B
      RRA
      LD D,A
      LD A,C
      ADD A,B
      RRA
      ....

[/code]

Title: Re: the fast multiplication
Post by: litwr on 21:19, 22 October 15

QuoteIf I remember correctly "rra" is smaller and faster than "rr a" ;)

Yes, it looks like my typo - it is fixed. So the routine takes one byte and 4 cycles less. ;D Thanks. :)

Quotea possible solution could be:

It is two bytes shorter and takes 4 cycles less for the case A<B and the same cycles if A>=B. So it gives 2 cycles less on average! Thanks again! :)
The routine occupies only 44 bytes and executes during 136-172 cycles (154 on average) now! :) I've just fixed the code in the beginning of this topic. Is it ready for CPCwiki? ;)

QuoteThose cycles don't appear to relate to the CPC timings for the routines. Most of the routines I've included there have microsecond timing in-line with the code. If you're going to put the T-States, please label it as T-States (non-CPC timing).

Sorry, maybe I do not understand something. :( IMHO CPC has 4MHz CPU and every instruction takes 4*n cycles = n microseconds. I didn't use microseconds because it is specific for CPC only. For example, 4 CPU cycles (T-states) of z80 corresponds to 4 CPU cycles of 8088, 6502, ... My point is to provide data which are easier to compare with other architectures. If you want you may change units. I only hope that with any units is better than total absence of data.

Title: Re: the fast multiplication
Post by: andycadley on 21:35, 22 October 15

Quote from: litwr on 21:19, 22 October 15
Sorry, maybe I do not understand something. :( IMHO CPC has 4MHz CPU and every instruction takes 4*n cycles = n microseconds. I didn't use microseconds because it is specific for CPC only. For example, 4 CPU cycles (T-states) of z80 corresponds to 4 CPU cycles of 8088, 6502, ... My point is to provide data which are easier to compare with other architectures. If you want you may change units. I only hope that with any units is better than total absence of data.

The problem is that T-states are all but useless as a measure of timing on the CPC, because of the way it's contention model stretches out certain instructions (so, for example, instructions that officially take 6 or 7 T-states may actually last 8us). This can alter the timing such that a routine that is "optimal" based on generic Z80 timings is sub-optimal on the CPC and vice versa.

Because all instructions end up as multiples of 4 T-states, it's more common on the CPC to measure timings directly in us (or NOP cycles as they're commonly known)

Title: Re: the fast multiplication
Post by: litwr on 08:55, 24 October 15

I agree T-states is too z80 internal so it is better to use the system cycles which is given by the system quartz. For CPC/PCW 4 cycles = 1 usec. I've added timing to multiplication wiki-page. I've also added my multiplication routine. ;)

Title: Re: the fast multiplication
Post by: TFM on 19:20, 27 October 15

On the CPC everything is a multiple of 1 us. So yes, let's just use microseconds and it's all good. :)

Title: Re: the fast multiplication
Post by: Ast on 22:01, 28 October 15

Quote from: litwr on 07:07, 18 October 15
I'm seeking the fastest way to make a unsigned byte to byte multiplication for z80. The amazing routine at
Z80 Assembly (http://sgate.emt.bme.hu/patai/publications/z80guide/part4.html)
takes 344-400 cycles (372 on average) and occupies 12 bytes. Its unrolled version takes only 216-272 cycles (244 on average) and occupies only 38 bytes! It was a pleasant surprize that z80 is so fast with arithmetic. BTW the best known similar 6502 routine
base:8bit_multiplication_16bit_product [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:8bit_multiplication_16bit_product)
takes 25-217 cycles (166.1 on average). It occupies 27 bytes. Its unrolled version takes about 148 clocks and occupies 144 bytes. However z80 has at least twice clocks. ;D
The precalculated tables may increase the speed. The best 6502's algorithm
base:seriously_fast_multiplication [Codebase 64 wiki] (http://codebase64.org/doku.php?id=base:seriously_fast_multiplication)
takes 47-51 cycles (49 on average) but it uses 2048 bytes in the tables! So I try to use only 512 bytes in the tables and I've made the next code. It uses formula
x*y = ((x+y)/2)^2 - ((x-y)/2)^2, if x+y is even
= ((x+y-1)/2)^2 - ((x-y-1)/2)^2 + y, if x+y is odd and x>=y
Code Select Expand
org $9c00 start ld a,101 ;for the tests ld b,153 call mul8x8to16 ld (rhi),a ld a,e ld (rlo),a ret rlo db 0 rhi db 0 org $9d00 sqrlo ;low(x*x) db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1 db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1 sqrhi ;high(x*x) db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3 db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8 db 9,9,9,$a,$a,$a,$b,$b,$c,$c,$d,$d,$e,$e,$f,$f db $10,$10,$11,$11,$12,$12,$13,$13,$14,$14,$15,$15,$16,$17,$17,$18 db $19,$19,$1a,$1a,$1b,$1c,$1c,$1d,$1e,$1e,$1f,$20,$21,$21,$22,$23 db $24,$24,$25,$26,$27,$27,$28,$29,$2a,$2b,$2b,$2c,$2d,$2e,$2f,$30 db $31,$31,$32,$33,$34,$35,$36,$37,$38,$39,$3a,$3b,$3c,$3d,$3e,$3f db $40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$4a,$4b,$4c,$4d,$4e,$4f db $51,$52,$53,$54,$55,$56,$57,$59,$5a,$5b,$5c,$5d,$5f,$60,$61,$62 db $64,$65,$66,$67,$69,$6a,$6b,$6c,$6e,$6f,$70,$72,$73,$74,$76,$77 db $79,$7a,$7b,$7d,$7e,$7f,$81,$82,$84,$85,$87,$88,$8a,$8b,$8d,$8e db $90,$91,$93,$94,$96,$97,$99,$9a,$9c,$9d,$9f,$a0,$a2,$a4,$a5,$a7 db $a9,$aa,$ac,$ad,$af,$b1,$b2,$b4,$b6,$b7,$b9,$bb,$bd,$be,$c0,$c2 db $c4,$c5,$c7,$c9,$cb,$cc,$ce,$d0,$d2,$d4,$d5,$d7,$d9,$db,$dd,$df db $e1,$e2,$e4,$e6,$e8,$ea,$ec,$ee,$f0,$f2,$f4,$f6,$f8,$fa,$fc,$fe mul8x8to16 proc ;IN: A, B; OUT: A (HIGH), E (LOW); 44 bytes; 136-172 cycles on CPC6128 (154 on average) local l1,l2 cp b jr nc,l1 ld e,a ld a,b ld b,e l1 ld c,a sub b rra ld d,a ld a,c add a,b rra ld l,a ld h,high(sqrlo) ld a,(hl) ld e,l ld l,d jr nc,l2 sub (hl) ;odd ld l,e ld e,a inc h ;sets high(sqrhi) ld a,(hl) ld l,d sbc a,(hl) ld d,a ld a,e add a,b ld e,a ld a,d adc a,0 ret l2 sub (hl) ;even ld l,e ld e,a inc h ld a,(hl) ld l,d sbc a,(hl) ret endp
The calculations takes 136-172 cycles on CPC6128 (154 on average). Its 6502 equivalent has 67.5 clocks on average. I'm not sure that my z80 code is perfect - I couldn't use 16-bit operations. Any suggestions, improvements will be pleased. :-)
BTW I couldn't find fast multiplication algorithms with tables for z80...

I like your work. Well done!

Title: Re: the fast multiplication
Post by: litwr on 19:20, 13 November 15

Thanks. :)
I am starting to seek a way to the fastest possible 16x16=32 multiplication. I am using Programming:Integer Multiplication - CPCWiki (http://cpcwiki.eu/index.php/Programming:Integer_Multiplication#Fastest.2C_accurate_8bit_.2A_8bit_Unsigned_with_16_KB_tables) as the base. So I can get

Code Select


product dw 0,0
mul16x16 proc    ;ix,iy  -> a,hl,product
         ld c,ixl
         ld a,iyl
         ld l,a
         mul8x8   ;l*c -> de, uses bc,hl,a
         ld (product),de

         ld c,ixh
         ld a,iyh
         ld l,a
         mul8x8
         ld (product+2),de

         ld c,ixh
         ld a,iyl
         ld l,a
         mul8x8
         ld hl,(product+1)
         add hl,de
         push hl
         ld a,(product+3)
         adc a,0
         ld ixh,a

         ld c,ixl
         ld a,iyh
         ld l,a
         mul8x8
         pop hl
         xor a
         add hl,de
         adc a,ixh
         ret
         endp

It takes 548 cycles or 137 usec plus RET. Could anybody try to help to find something better? ;) BTW CPCwiki missed 16x16->32 multiplication. :(

Title: Re: the fast multiplication
Post by: Repose on 06:41, 23 December 24

I admire your quest for speed! I was also down this rabbit hole on 6502, resulting in a world record. On the C64, I can do this in 183uS, only 33% slower than your Z80 at 3.5x faster clock rate. The routines are not comparable though, because I use a 2k table.
I don't understand why people are against large tables if the math is so slow.
You can take your pick however (8x8):

Speed	Size
45.5	1580
47.5	1061
67.5	574
107	69

16x16:

Speed	Size
187.1	2170
260	1210
386	279

I'd love to see how fast a Z80 version of the 2k tables algorithm would be.

Title: Re: the fast multiplication
Post by: zhulien on 18:36, 18 February 25

do you have a fast divide to go with the fast multiplication?

Title: Re: the fast multiplication
Post by: GUNHED on 19:48, 18 February 25

Having all that ROM space... using tables is a very nice thing! :)

Title: Re: the fast multiplication
Post by: arnolde on 21:05, 18 February 25

I really like this collection of routines:
https://github.com/Zeda/Z80-Optimized-Routines (https://github.com/Zeda/Z80-Optimized-Routines)
Although they are not optimized for the CPC, I often find very useful snippets and inspiration there.

Title: Re: the fast multiplication
Post by: zhulien on 12:18, 26 February 25

Quote from: GUNHED on 19:48, 18 February 25Having all that ROM space... using tables is a very nice thing! :)

Actually I wonder if making a cpc maths rom is a good idea?

Title: Re: the fast multiplication
Post by: GUNHED on 00:04, 03 March 25

Yes! You should do that! :) :) :)

Title: Re: the fast multiplication
Post by: calc84maniac on 22:23, 05 March 25

Hey, so someone brought up the wiki's 512-byte table routine in the context of Game Boy development, and I thought I'd take a crack at seeing if the speed and size could be made closer to the 1KB table routine. I actually had a good bit of success, by making the following observations:

The x >= y constraint isn't necessary if negative values for floor((x - y) / 2) have their square calculated properly, so no need to swap inputs.
Going from floor((x + y) / 2) to floor((x - y) / 2) simply requires a subtraction of y, so no need to divide by 2 twice or save the input value of x.
To square a negative number z, it's sufficient to calculate z ^ 2 = (z + 256) ^ 2 - z * 512 - 65536, meaning a conditional offset of -(z * 512) in the subtracted square, or z * 512 in the final result.
When conditionally adding y to the result when x + y is odd, the carry can be propagated into the low bit of the 8-bit register holding the upper half of z * 512.
Both of these conditional adds can be turned into branchless adds of either the value or zero.

This results in the following implementation (which I adjusted to take the same inputs and outputs as the 1KB routine for a fair comparison), which comes out to 27 bytes and a constant 31 usec:

Code Select

; Input: A = x, L = y
; Output: DE = x * y
    ; C = floor((x + y) / 2)
    add a,l          ; 1
    rra              ; 2
    ld c,a           ; 3
    ; E = y if x + y is odd, else 0
    sbc a,a          ; 4
    and l            ; 5
    ld e,a           ; 6
    ; L = floor((x - y) / 2)
    ld a,c           ; 7
    sub l            ; 8
    ld l,a           ; 9
    ; D = L if negative, else 0
    sbc a,a          ; 10
    and l            ; 11
    ld d,a           ; 12
    ; Calculate low half of product
    ld h,high(sqrlo) ; 14
    ld b,h           ; 15
    ld a,(bc)        ; 17
    add a,e          ; 18
    rl d             ; 20, multiplies D by 2 and propagates carry
    sub (hl)         ; 22
    ld e,a           ; 23
    ; Calculate high half of product
    inc h            ; 24, loads high(sqrhi)
    ld b,h           ; 25
    ld a,(bc)        ; 27
    sbc a,(hl)       ; 29
    add a,d          ; 30
    ld d,a           ; 31

Title: Re: the fast multiplication
Post by: Prodatron on 13:15, 06 March 25

Very impressive! :o Thanks a lot, @calc84maniac !

I just added it to the Wiki, I hope it's ok for you:

https://www.cpcwiki.eu/index.php/Programming:Integer_Multiplication#Fast_8bit_.2A_8bit_Unsigned_with_only_512_bytes_of_tables_.28improved_version.29

Greetings to the TI scene!

Title: Re: the fast multiplication
Post by: zhulien on 03:42, 12 March 25

if a couple of small optimisations were done at the start would those cases be faster?

i.e.

if D or E = 0, then the result is 0
if D or E = 1, then we know the answer also
if D or E = 2, then do a shift, possibly similar for other common factors
if D and E are equal, maybe do a fast square routine ( https://www.youtube.com/watch?v=HWF1y5FfGXo )
I'm sure there are some others?

Title: Re: the fast multiplication
Post by: zhulien on 20:33, 12 March 25

How about

First byte is an index to a jump table of 256 address, jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb

Title: Re: the fast multiplication
Post by: Prodatron on 22:01, 12 March 25

Don't be shy, just do it. I am looking forward to your new improved routines, which are faster than the previous ones.

Title: Re: the fast multiplication
Post by: eto on 23:23, 12 March 25

Quote from: zhulien on 03:42, 12 March 25I'm sure there are some others?

I have the gut feeling that checking D & E for shortcuts adds so much overhead to the code that overall you will end up with a higher average execution time.

Title: Re: the fast multiplication
Post by: calc84maniac on 09:44, 14 March 25

Quote from: zhulien on 20:33, 12 March 25How about

First byte is an index to a jump table of 256 address, jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb

For this, if the idea is that speed is of the utmost importance, I can share a technique I've used before for lightning-fast table dispatch. The trick is to allocate each routine at an address aligned with its LSB corresponding to the table index, which means only the MSB must be looked up in the table. The dispatch then looks like:

Code Select

; Routine to dispatch is in L
ld h,high(routine_msb_table)
ld h,(hl)
jp (hl)

For multiplication in particular, the use of HL for the dispatch is a bit unfortunate, but starting each routine with either an EX DE, HL (if the multiplicand is loaded into both DE and BC before the dispatch) or a load from the multiplicand register into HL wouldn't be too bad.

If generating the routines programmatically at runtime (or at assembly time with some kind of macro magic), it's possible to take a greedy approach: start with generating routine 0 at $xx00, then use the LSB of the address after the generated routine to decide the next one to be generated. If that routine has already been generated, increment the address until the next ungenerated routine. This potentially ends up with wasted space, but saving 256 bytes on the lookup table can somewhat make up for that. Though, in this case in particular, larger multipliers are skewed towards larger implementations, so there might end up being a lot of padding in the first half of the later 256-byte pages when using this approach.

Title: Re: the fast multiplication
Post by: zhulien on 14:11, 14 March 25

Quote from: calc84maniac on 09:44, 14 March 25
Quote from: zhulien on 20:33, 12 March 25How about

First byte is an index to a jump table of 256 address, jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb
For this, if the idea is that speed is of the utmost importance, I can share a technique I've used before for lightning-fast table dispatch. The trick is to allocate each routine at an address aligned with its LSB corresponding to the table index, which means only the MSB must be looked up in the table. The dispatch then looks like:

Code Select Expand
; Routine to dispatch is in L ld h,high(routine_msb_table) ld h,(hl) jp (hl)
For multiplication in particular, the use of HL for the dispatch is a bit unfortunate, but starting each routine with either an EX DE, HL (if the multiplicand is loaded into both DE and BC before the dispatch) or a load from the multiplicand register into HL wouldn't be too bad.

If generating the routines programmatically at runtime (or at assembly time with some kind of macro magic), it's possible to take a greedy approach: start with generating routine 0 at $xx00, then use the LSB of the address after the generated routine to decide the next one to be generated. If that routine has already been generated, increment the address until the next ungenerated routine. This potentially ends up with wasted space, but saving 256 bytes on the lookup table can somewhat make up for that. Though, in this case in particular, larger multipliers are skewed towards larger implementations, so there might end up being a lot of padding in the first half of the later 256-byte pages when using this approach.

that's an awesome idea, i was always wondering a practical use of ld h,(hl)

Title: Re: the fast multiplication
Post by: zhulien on 14:14, 14 March 25

without the jump table optimsation above, 54 tstates overhead. I am positive someone can code better specific factor routines than I can.

Code Select

;timings from: https://map.grauw.nl/resources/z80instr.php
;notes: overhead of jumps table
;		54+<factortiming>


;HL = C * E
multce:
										; e should be jump table factor
										; c the other factor
	ld d, 0								;7
	ld hl, jumptable					;10
	add hl, de							;11
	ld e, (hl)							;7
	inc hl								;4
	ld d, (hl)							;7
	ex de, hl							;4
	jp (hl)								;4, total: 54
	
jumptable:
	dw times0, times1, times2, times3, times4, times5, times6, times7, times8
	dw times9, times10, times11, times12, times13, times14, times15, times16

times0:

	ld h, 0								;7
	ret									;10, total: 17

times1:

	ld l, c								;4
	ld h, 0								;4
	ret									;10, total: 18

times2:

	ld l, c								;4
	ld h, 0								;7
	add hl, hl							;11 x2
	ret									;10, total: 32

times3:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11 x2
	add hl, bc							;11 x3
	ret									;10, total: 48

times4:

	ld l, c								;4
	ld h, 0								;7
	add hl, hl							;11 x2
	add hl, hl							;11 x4
	ret									;10, total: 43

times5:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11 x2
	add hl, hl							;11 x4
	add hl, bc							;11 x5
	ret									;10, total: 59

times6:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	ret									;10, total: 59

times7:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11 x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	add hl, bc							;11 x7
	ret									;10, total: 70

times8:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, hl							;11 x4
	add hl, hl							;11 x8
	ret									;10, total: 59

times9:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, hl							;11 x4
	add hl, hl							;11 x8
	add hl, bc							;11 x9
	ret									;10, total: 70

times10:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, hl							;11 x4
	add hl, bc							;11 x5
	add hl, hl							;11 x10
	ret									;10, total: 70

times11:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, hl							;11 x4
	add hl, bc							;11 x5
	add hl, hl							;11 x10
	ret									;10, total: 70

times12:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	add hl, hl							;11 x12
	ret									;10, total: 70

times13:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	add hl, hl							;11 x12
	add hl, bc							;11 x13
	ret									;10, total: 81

times14:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11 x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	add hl, bc							;11 x7
	add hl, hl							;11 x14
	ret									;10, total: 81

times15:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11 x2
	add hl, bc							;11 x3
	add hl, hl							;11 x6
	add hl, bc							;11 x7
	add hl, hl							;11 x14
	add hl, bc							;11 x15
	ret									;10, total: 92

times16:

	ld l, c								;4
	xor a 								;4
	ld h, a								;4
	ld b, a								;4
	add hl, hl							;11	x2
	add hl, hl							;11 x4
	add hl, hl							;11 x8
	add hl, hl							;11 x8
	ret									;10, total: 70

I did attempt two other ideas before just jumping, that was to reverse the highest and lowest factors so the lowest one is always the one jumped, and the other was to return on a factor of 0. Both optimisations bring the overhead to about 113 tstates.

So then. If we have 54 tstates overhead, or we can lower that considerably by using the ld h, (hl) optimisation (18 tstates from 54?), then can we make every factor up to 256 optimal.

Title: Re: the fast multiplication
Post by: zhulien on 15:07, 15 March 25

Couldn't a not as optimal way than the ld h, (hl) method but faster than my original method just to ensure hl falls on a 256 byte boundary to start. Just
; l is already a factor
ld h, 0
add hl, hl
ld h,#40 ;high address byte of 256 byte table
call (hl)

30 tstates

Which I guess could lead to a separate topic of fast table lookups. Can be faster if don't need so many jumps, I.e. 128, by just

add a,a
ld h,#40
ld l, a
call (hl)

Screen scanline address lookups?

if we wanted a wide-screen game or game with a 128 line playfield

CPCWiki forum

General Category => Programming => Topic started by: litwr on 07:07, 18 October 15