News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu

the fast multiplication

Started by litwr, 07:07, 18 October 15

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

zhulien

#25
if a couple of small optimisations were done at the start would those cases be faster?

i.e.

if D or E = 0, then the result is 0
if D or E = 1, then we know the answer also
if D or E = 2, then do a shift, possibly similar for other common factors
if D and E are equal, maybe do a fast square routine ( )
I'm sure there are some others?

zhulien

How about

First byte is an index to a jump table of 256 address,  jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb

Prodatron

Don't be shy, just do it. I am looking forward to your new improved routines, which are faster than the previous ones.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

eto

Quote from: zhulien on 03:42, 12 March 25I'm sure there are some others?
I have the gut feeling that checking D & E for shortcuts adds so much overhead to the code that overall you will end up with a higher average execution time.

calc84maniac

Quote from: zhulien on 20:33, 12 March 25How about

First byte is an index to a jump table of 256 address,  jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb
For this, if the idea is that speed is of the utmost importance, I can share a technique I've used before for lightning-fast table dispatch. The trick is to allocate each routine at an address aligned with its LSB corresponding to the table index, which means only the MSB must be looked up in the table. The dispatch then looks like:

; Routine to dispatch is in L
ld h,high(routine_msb_table)
ld h,(hl)
jp (hl)

For multiplication in particular, the use of HL for the dispatch is a bit unfortunate, but starting each routine with either an EX DE, HL (if the multiplicand is loaded into both DE and BC before the dispatch) or a load from the multiplicand register into HL wouldn't be too bad.

If generating the routines programmatically at runtime (or at assembly time with some kind of macro magic), it's possible to take a greedy approach: start with generating routine 0 at $xx00, then use the LSB of the address after the generated routine to decide the next one to be generated. If that routine has already been generated, increment the address until the next ungenerated routine. This potentially ends up with wasted space, but saving 256 bytes on the lookup table can somewhat make up for that. Though, in this case in particular, larger multipliers are skewed towards larger implementations, so there might end up being a lot of padding in the first half of the later 256-byte pages when using this approach.

zhulien

Quote from: calc84maniac on 09:44, 14 March 25
Quote from: zhulien on 20:33, 12 March 25How about

First byte is an index to a jump table of 256 address,  jump to the most specific code for the multiplication factor.

0 can return 0
1 can return the the factor
2 shift 1x
3 shift 1x add original
4 shift 2x
5 shift 2x add original
6 shift 2x add original 2x
7 shift 3x sub original
8 shift 3x

Etc

Should fit in roughly 2kb to 3kb
For this, if the idea is that speed is of the utmost importance, I can share a technique I've used before for lightning-fast table dispatch. The trick is to allocate each routine at an address aligned with its LSB corresponding to the table index, which means only the MSB must be looked up in the table. The dispatch then looks like:

; Routine to dispatch is in L
ld h,high(routine_msb_table)
ld h,(hl)
jp (hl)

For multiplication in particular, the use of HL for the dispatch is a bit unfortunate, but starting each routine with either an EX DE, HL (if the multiplicand is loaded into both DE and BC before the dispatch) or a load from the multiplicand register into HL wouldn't be too bad.

If generating the routines programmatically at runtime (or at assembly time with some kind of macro magic), it's possible to take a greedy approach: start with generating routine 0 at $xx00, then use the LSB of the address after the generated routine to decide the next one to be generated. If that routine has already been generated, increment the address until the next ungenerated routine. This potentially ends up with wasted space, but saving 256 bytes on the lookup table can somewhat make up for that. Though, in this case in particular, larger multipliers are skewed towards larger implementations, so there might end up being a lot of padding in the first half of the later 256-byte pages when using this approach.
that's an awesome idea, i was always wondering a practical use of ld h,(hl)

zhulien

#31
without the jump table optimsation above, 54 tstates overhead.  I am positive someone can code better specific factor routines than I can.

;timings from: https://map.grauw.nl/resources/z80instr.php
;notes: overhead of jumps table
; 54+<factortiming>


;HL = C * E
multce:
; e should be jump table factor
; c the other factor
ld d, 0 ;7
ld hl, jumptable ;10
add hl, de ;11
ld e, (hl) ;7
inc hl ;4
ld d, (hl) ;7
ex de, hl ;4
jp (hl) ;4, total: 54

jumptable:
dw times0, times1, times2, times3, times4, times5, times6, times7, times8
dw times9, times10, times11, times12, times13, times14, times15, times16

times0:

ld h, 0 ;7
ret ;10, total: 17

times1:

ld l, c ;4
ld h, 0 ;4
ret ;10, total: 18

times2:

ld l, c ;4
ld h, 0 ;7
add hl, hl ;11 x2
ret ;10, total: 32

times3:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
ret ;10, total: 48

times4:

ld l, c ;4
ld h, 0 ;7
add hl, hl ;11 x2
add hl, hl ;11 x4
ret ;10, total: 43

times5:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, bc ;11 x5
ret ;10, total: 59

times6:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
ret ;10, total: 59

times7:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
add hl, bc ;11 x7
ret ;10, total: 70

times8:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, hl ;11 x8
ret ;10, total: 59

times9:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, hl ;11 x8
add hl, bc ;11 x9
ret ;10, total: 70

times10:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, bc ;11 x5
add hl, hl ;11 x10
ret ;10, total: 70

times11:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, bc ;11 x5
add hl, hl ;11 x10
ret ;10, total: 70

times12:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
add hl, hl ;11 x12
ret ;10, total: 70

times13:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
add hl, hl ;11 x12
add hl, bc ;11 x13
ret ;10, total: 81

times14:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
add hl, bc ;11 x7
add hl, hl ;11 x14
ret ;10, total: 81

times15:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, bc ;11 x3
add hl, hl ;11 x6
add hl, bc ;11 x7
add hl, hl ;11 x14
add hl, bc ;11 x15
ret ;10, total: 92

times16:

ld l, c ;4
xor a ;4
ld h, a ;4
ld b, a ;4
add hl, hl ;11 x2
add hl, hl ;11 x4
add hl, hl ;11 x8
add hl, hl ;11 x8
ret ;10, total: 70

I did attempt two other ideas before just jumping, that was to reverse the highest and lowest factors so the lowest one is always the one jumped, and the other was to return on a factor of 0.  Both optimisations bring the overhead to about 113 tstates.

So then.  If we have 54 tstates overhead, or we can lower that considerably by using the ld h, (hl) optimisation (18 tstates from 54?), then can we make every factor up to 256 optimal.

zhulien

#32
Couldn't a not as optimal way than the ld h, (hl) method but faster than my original method just to ensure hl falls on a 256 byte boundary to start. Just
                 ; l is already a factor
ld h, 0
add hl, hl
ld h,#40   ;high address byte of 256 byte table
call (hl)

30 tstates

Which I guess could lead to a separate topic of fast table lookups. Can be faster if don't need so many jumps, I.e. 128, by just

add a,a
ld h,#40
ld l, a
call (hl)

Screen scanline address lookups?

if we wanted a wide-screen game or game with a 128 line playfield

Powered by SMFPacks Menu Editor Mod