Changes

Jump to: navigation, search

Programming:Integer Multiplication

10,405 bytes added, 17:31, 1 March 2018
/* 16bit * 16bit Unsigned -> 32 bits result */ forgotten label
'''Output:''' HL = ''Product''
'''CPC Cycles:''' 188-244 (216 on average) = 47-61 (54) usec
 
'''Size:''' 33 bytes
 
The additional clocks and bytes maybe required to set up D and L and for RET.
<pre>
sla h ; optimised 1st iteration
add hl,de ; ...
</pre>
 
== Classic 16bit * 8bit Unsigned ==
'''Output:''' A:HL = ''Product''
 
'''CPC Cycles:''' 212-300 (256 on average) = 53-75 (64) usec
 
'''Size:''' 47 bytes
<pre>
adc a,c ; ...
</pre>
 
== Fast 8bit * 8bit Unsigned (using log / antilog tables) ==
'''Output:''' DE = ''Product''
'''CPC Cycles:''' 124 = 31 usec
 
'''Size:''' 1024 bytes (768 bytes for the tables)
 
'''Warning:''' It is VERY approximate routine, e.g., 0*7=7, 1*28=27, 10*81=790, 100*100=9742, ...
<pre>
</pre>
== Fastest, accurate 8bit * 8bit Unsigned with 16 KB tables ==
'''Input:''' L =Multiplier, C = 16bit * 16bit Unsigned ==Multiplicand
'''InputOutput:''' BC = ''Multiplier'', DE = ''Multiplicand''Product
'''OutputCPC Cycles:''' A,HL 80 = 20 usec ''Product'Size:''' 14 bytes of code + 16 KB tables I'm currently working on a new routine, based on a routine by Kirk Meyer [http://www.ticalc.org/pub/86/asm/source/routines/vfmult.asm] which uses nibble multiplication tables. I have a working, tested version which uses 16K of tables and can perform multiplication in 20&mu;s. The code to produce the tables is below:
<pre>
.maketabsld hl,hltab1.makelp1ld a,lrlaand #1eadd restab / 256ld (hl),ainc ljr nz,makelp1inc h.makelp2ld a,lrra:rra:rraand #1ejr z,usezadd restab2 - restab / 256 - 2.usezadd restab / 256ld (hl),ainc ljr nz,makelp2inc h ; restab.makelp3ld a,(hl)inc hld d,(hl)inc hadd lld (hl),ainc hld a,0adc dld (hl),adec h:dec h:dec hinc ljr nz,makelp3inc hinc hld a,hcp restab2 / 256 - 2jr nz,makelp3ld b,hld c,linc binc bld h,restab / 256 + 2.makelp4ld e,(hl)inc hld d,(hl)dec hex de,hladd hl,hl:add hl,hl:add hl,hl:add hl,hlex de,hlld a,eld (bc),ainc bld a,dld (bc),adec binc linc cjr nz,makelp4inc hinc hinc binc bld a,hcp restab2 / 256jr nz,makelp4ret ds -$ and #ff.hltab1ds 256.hltab2ds 256.restabds 512 * 16.restab2ds 512 * 15</pre> The code to perform the multiplication (DE = L * C): <pre>ld h,hltab1 / 256 ; 2ld b,(hl) ; 4inc h ; 5ld h,(hl) ; 7ld l,c ; 8ld a,(bc) ; 10add (hl) ; 12ld e,a ; 13inc b ; 14inc h ; 15ld a,(bc) ; 17adc (hl) ; 19ld d,a ; 20</pre> == Fast, accurate 8bit * 8bit Unsigned with 8 KB tables == '''Input:''' L = Multiplier, C = Multiplicand '''Output:''' DE = Product '''CPC Cycles:''' 100 = 25 usec '''Size:''' 19 bytes of code + 8 KB tables 16K is a lot of memory to use for tables, but I'm working on a way to reduce this to 8K while maintaining similar performance (around 26&mu;s). The idea is rather than using tables for the low and high nibbles, use tables for alternate bits. So there would be 16 tables for the values #00, #01, #04, #05, #10, #11, #14, #15, #40, #41, #44, #45, #50, #51, #54, #55. The values can be shifted by 1 (rather than 4) to give values for the alternate bits using a simple ADD HL,HL. This routine should be easy to adapt to a 16bit * 8bit unsigned multiplication or even a 16bit * 16bit, using simple additions. [[User:Executioner|Executioner]] 03:58, 8 May 2007 (CEST) Ok, here is the new 8K routine, first the routine to build the tables: <pre>.maketabsld hl,hltab1.makelp5ld a,lld bc,0rla:rl b:rla:rl crla:rl b:rla:rl crla:rl b:rla:rl crla:rl b:rla:rl cld a,badd aadd restab / 256ld (hl),ainc hld a,cadd aadd restab / 256ld (hl),adec hinc ljr nz,makelp5ld h,restab / 256 + 2.makelp6ld (hl),linc ljr nz,makelp6inc h:inc hld b,2.makelp8push bcxor arr b:rra:rrcarr b:rra:rrcarr b:rra:rrcarr b:rra:rrca.makelp7push hlpush afcall mulLbyAex de,hlpop afpop hlld (hl),einc hld (hl),ddec hinc ljr nz,makelp7inc h:inc hpop bcinc bbit 4,bjr z,makelp8ret .mulLbyA ; MAX times ; MIN timesld e,l ; 1 ; 1ld d,0 ; 3 ; 3add a ; 4 ; 4ld h,a ; 5 ; 5jr c,ncad0 ; 8 ; 8ld l,d .ncad0add hl,hl ; 11 ; 11jr nc,ncad1 ; 13 ; 14add hl,de ; 16.ncad1add hl,hl ; 19 ; 17jr nc,ncad2 ; 21 ; 20add hl,de ; 24.ncad2add hl,hl ; 27 ; 23jr nc,ncad3 ; 29 ; 26add hl,de ; 32.ncad3add hl,hl ; 35 ; 29jr nc,ncad4 ; 37 ; 32add hl,de ; 40.ncad4add hl,hl ; 43 ; 35jr nc,ncad5 ; 45 ; 38add hl,de ; 48.ncad5add hl,hl ; 51 ; 41jr nc,ncad6 ; 53 ; 44add hl,de ; 56.ncad6add hl,hl ; 59 ; 47ld a,h ; 60 ; 48ret nc ; 61 ; 50?add hl,de ; 64ld a,h ; 65ret ds -$ and #ff .hltab1ds 256.hltab2ds 256.restabds 512 * 16</pre> And the routine to do the multiplication: <pre>.umultCL ; Using 8 bit only (25us - DE = L * C)ld h,hltab2 / 256 ; 2ld b,(hl) ; 4dec h ; 5ld h,(hl) ; 7ld l,c ; 8ld a,(hl) ; 10add a ; 11inc h ; 12ld d,(hl) ; 14rl d ; 16ld h,b ; 17add (hl) ; 19ld e,a ; 20inc h ; 21ld a,(hl) ; 23adc d ; 24ld d,a ; 25ret</pre> == Very fast 8bit * 8bit Unsigned with only 1K of tables == '''Input:''' A = ''Multiplier'', L = ''Multiplicand'' '''Output:''' DE = ''Product'' '''CPC Cycles''': 104-112 (108 on average) = 26-28 (27) usec '''Size''': 24 bytes of code and 1024 bytes for the tables Here's a new routine I've developed which uses the formula ab = ((a + b)<sup>2</sup> - (a - b)<sup>2</sup>) / 4. It's based on a routine for the 6502 by Stephen Judd in a C= Hacking article. Because of differences between the way the 6502 does register indexing it was quite difficult to actually get this working, but it's a great compromise between speed and space since it only uses 1K of tables (as opposed to the 16K or 8K table routines above), and can still manage to do the job in a maximum of 28 microseconds. <br>Firstly, once again, we need some code to generate the tables. These tables contain values for <br>x<sup>2</sup>/4 for 9 bit values of x, with the LSB when bit 8 is zero first followed by the MSB.<pre>.gen_sq4 xor a ld de,umul_tab + #1ff ld (de),a dec d ld (de),a ld h,d ld l,e inc e ld c,e ld b,2  .sq4_lp ld a,b cp 2 ld a,e rra add (hl) ld (de),a inc h inc d ld a,(hl) adc c ld (de),a dec d ld h,d inc l inc e jr nz,sq4_lp inc d inc d djnz sq4_lp ret align #100.umul_tab ds #400 </pre>Now for the actual multiply routine: <pre> ld h,umul_tab_lo / #100 ; 2 ld b,h ; 3 add l ; 4 ld c,a ; 5 jr nc,@noovf ; 7 inc b ; 8 inc b ; 9.@noovf sub l ; 10 sub l ; 11 jr nc,@noneg ; 13 neg ; 15.@noneg ld l,a ; 16 ld a,(bc) ; 18 sub (hl) ; 20 ld e,a ; 21 inc b ; 22 inc h ; 23 ld a,(bc) ; 25 sbc (hl) ; 27 ld d,a ; 28 </pre>This code could easily be converted to a macro as it's only 24 bytes. I've tried to optimize it further but with no luck! [[User:Executioner|Executioner]] 06:25, 4 April 2008 (CEST) == Fast 8bit * 8bit Unsigned with only 512 bytes of tables == '''Input''': A = Multiplier, B = Multiplicant '''Output''': A:E = Product '''CPC Cycles''': 136-172 (154 on average) = 34-43 (38.5) usec '''Size''': 44 bytes of code and 512 bytes for the tables Fast 8 bit unsigned multiplication with 16 bit result. It uses formula<br>x*y = ((x+y)/2)<sup>2</sup> - ((x-y)/2)<sup>2</sup>, if x+y is even<br> = ((x+y-1)/2)<sup>2</sup> - ((x-y-1)/2)<sup>2</sup> + y, if x+y is odd and x>=y <pre> cp b jr nc,l1  ld e,a ld a,b ld b,e l1 ld c,a sub b rra ld d,a ld a,c add a,b rra ld l,a ld h,high(sqrlo) ld a,(hl) ld e,l ld l,d jr nc,l2  sub (hl) ;odd ld l,e ld e,a inc h ;loads high(sqrhi) ld a,(hl) ld l,d sbc a,(hl) ld d,a  ld a,e add a,b ld e,a ld a,d adc a,0 ret l2 sub (hl) ;even ld l,e ld e,a inc h ld a,(hl) ld l,d sbc a,(hl) ret sqrlo ;low(x*x) should be at the page border db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1 db 0,1,4,9,$10,$19,$24,$31,$40,$51,$64,$79,$90,$a9,$c4,$e1 db 0,$21,$44,$69,$90,$b9,$e4,$11,$40,$71,$a4,$d9,$10,$49,$84,$c1 db 0,$41,$84,$c9,$10,$59,$a4,$f1,$40,$91,$e4,$39,$90,$e9,$44,$a1 db 0,$61,$c4,$29,$90,$f9,$64,$d1,$40,$b1,$24,$99,$10,$89,4,$81 db 0,$81,4,$89,$10,$99,$24,$b1,$40,$d1,$64,$f9,$90,$29,$c4,$61 db 0,$a1,$44,$e9,$90,$39,$e4,$91,$40,$f1,$a4,$59,$10,$c9,$84,$41 db 0,$c1,$84,$49,$10,$d9,$a4,$71,$40,$11,$e4,$b9,$90,$69,$44,$21 db 0,$e1,$c4,$a9,$90,$79,$64,$51,$40,$31,$24,$19,$10,9,4,$1sqrhi ;high(x*x) db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 db 1,1,1,1,1,1,1,2,2,2,2,2,3,3,3,3 db 4,4,4,4,5,5,5,5,6,6,6,7,7,7,8,8 db 9,9,9,$a,$a,$a,$b,$b,$c,$c,$d,$d,$e,$e,$f,$f db $10,$10,$11,$11,$12,$12,$13,$13,$14,$14,$15,$15,$16,$17,$17,$18 db $19,$19,$1a,$1a,$1b,$1c,$1c,$1d,$1e,$1e,$1f,$20,$21,$21,$22,$23 db $24,$24,$25,$26,$27,$27,$28,$29,$2a,$2b,$2b,$2c,$2d,$2e,$2f,$30 db $31,$31,$32,$33,$34,$35,$36,$37,$38,$39,$3a,$3b,$3c,$3d,$3e,$3f db $40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$4a,$4b,$4c,$4d,$4e,$4f db $51,$52,$53,$54,$55,$56,$57,$59,$5a,$5b,$5c,$5d,$5f,$60,$61,$62 db $64,$65,$66,$67,$69,$6a,$6b,$6c,$6e,$6f,$70,$72,$73,$74,$76,$77 db $79,$7a,$7b,$7d,$7e,$7f,$81,$82,$84,$85,$87,$88,$8a,$8b,$8d,$8e db $90,$91,$93,$94,$96,$97,$99,$9a,$9c,$9d,$9f,$a0,$a2,$a4,$a5,$a7 db $a9,$aa,$ac,$ad,$af,$b1,$b2,$b4,$b6,$b7,$b9,$bb,$bd,$be,$c0,$c2 db $c4,$c5,$c7,$c9,$cb,$cc,$ce,$d0,$d2,$d4,$d5,$d7,$d9,$db,$dd,$df db $e1,$e2,$e4,$e6,$e8,$ea,$ec,$ee,$f0,$f2,$f4,$f6,$f8,$fa,$fc,$fe</pre> [[User:Litwr|Litwr]] 10:25, 24 October 2015 (CEST) == 16bit * 16bit Unsigned -> 32 bits result == '''Input:''' BC = ''Multiplier'', DE = ''Multiplicand'' '''Output:''' DE,HL = ''Product''<pre>ld hl,0ld a,16muluw add hl,hl rl e rl d jr nc,muluw_cont add hl,bc jr nc,muluw_cont inc demuluw_cont dec a jr nz,muluw</pre> == 16bit * 16bit Unsigned -> 24 bits result == '''Input:''' BC = ''Multiplier'', DE = ''Multiplicand'' '''Output:''' A,HL = ''Product''<pre>ld ix,0
ld hl,0
.mul24b1:
db #dd:ld d,h
ex de,hl
ret</pre> == Web links == * [http://map.tni.nl/articles/mult_div_shifts.php Multiplications and divisions on the MSX Assembly Page]* [http://www.mccw.hetlab.tk/92/Multiplication/en.html "Multiplication on a Z80" ( MSX Computer & Club Webmagazine)]* [http://www.andreadrian.de/oldcpu/Z80_number_cruncher.html The Z80 number cruncher (Andre Adrian)] (16/32bit multiplication, addition, right shift) 
[[Category:Programming]]
36
edits