Fast decrement of 16-bit value via index register?

Started by Cwiiis, 20:20, 24 October 21

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Cwiiis

So I have a structure in memory that, at least for compact, readable code, it makes sense to use an index register to access. Within it are a couple of 16-bit values that depending on the contents of a, get incremented or decremented.

Increment isn't a big deal;

    inc (ix+Struct1)
    jr nz,NextThing
    inc (ix+Struct1+1)
    jr NextThing


But the problem comes with decrement... Given it doesn't affect the carry flag, I don't see a way of easily doing the decrement without also affecting the a register, which I'd like to preserve... Currently I load it into hl, decrement hl and load back, but that's going to be twice as slow as the above.

    ld l,(ix+Struct1)
    ld h,(ix+Struct1+1)
    dec hl
    ld (ix+Struct1),l
    ld (ix+Struct1+1),h


I guess I could do something like this instead:

    ld h,a
    ld a,&ff
    add (ix+Struct1)
    ld (ix+Struct1),a
    ld a,h
    jr c,NextThing
    dec (ix+Struct1+1)
    jr NextThing


But that just seems so arcane... Am I missing something even simpler/better?

Cwiiis

Heh, of course as soon as I type it out, I realise I actually don't need to preserve a and so that second one reads fine without the ld h,a / ld a,h... Still interested to hear if anyone has anything faster though :)

fgbrain


why use IX in the first place?
its much faster to use HL instead..


Quote

ld hl, ix + Struct1

dec (hl)   ;  OR inc (hl)



and you're done.
_____

6128 (UK keyboard, Crtc type 0/2), 6128+ (UK keyboard), 3.5" and 5.25" drives, Reset switch and Digiblaster (selfmade), Inicron Romram box, Bryce Megaflash, SVideo & PS/2 mouse, , Magnum Lightgun, X-MEM, X4 Board, C4CPC, Multiface2 X4, RTC X4 and Gotek USB Floppy emulator.

Cwiiis

Quote from: fgbrain on 20:55, 24 October 21

why use IX in the first place?
its much faster to use HL instead..





and you're done.

Indeed, this would be quicker if it wasn't part of a larger structure with multiple values - I did do this without IX first, but the amount of address manipulation required made the code pretty unreadable and though likely quicker, wasn't so much quicker to be worth the cost in readability (at least for now). The examples I gave above are dummy examples, not the actual code.

Cwiiis

Quote from: fgbrain on 20:55, 24 October 21

why use IX in the first place?
its much faster to use HL instead..





and you're done.

Oh, I actually misread this - you're not done there because that's an 8-bit inc/dec, but it is faster in the case of over/underflow as it only has the one IX access in that case instead of the current two. Nice :)

Prodatron

#5
Quote from: fgbrain on 20:55, 24 October 21ld hl, ix + Struct1
Well, such a Z80 opcode doesn't exist :) but I guess you mean something like this:
Instead of...


LD IX,datarecord
...
do some stuff with (IX+struct1)...


...you would do...


LD HL,datarecord+struct1
...
do some stuff with (HL)


But as soon as you have multiple data records with the same structure, you would have to patch your code or use e.g. LD BC,struct1:ADD HL,BC to achieve the same what you can do with IX.


But this could still be faster, if you don't need to manipulate more things inside the data record.


Example:


ld ix,datarecord      ;4
[...]
ld l,(ix+Struct1)     ;5
ld h,(ix+Struct1+1)   ;5
dec hl                ;2
ld (ix+Struct1),l     ;5
ld (ix+Struct1+1),h   ;5 -> 26 NOPs


Vs.:


ld hl,datarecord      ;3
[...]
ld bc,Struct1         ;3
add hl,bc             ;3
ld c,(hl)             ;2
inc hl                ;2
ld b,(hl)             ;2
dec bc                ;2

ld (hl),b             ;2
dec hl                ;2
ld (hl),c             ;2 -> 23 NOPs


If you have several more things to do with your data structure, especially if the affected bytes are not very close to each other, the IX-usage could be faster again. That can be different from case to case.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

Prodatron

Summary: if you can move through your data structure in a sequential way, the HL-methode is faster. If you have to access it in very a random way, the IX-methode ist faster.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

m_dr_m

Quote from: Cwiiis on 20:20, 24 October 21But the problem comes with decrement...

Then store the opposite of the value and increment it!


Also: consider structure of arrays rather than array of structures.

andycadley

If you'll never have more than 256 records you can page align things so you only need Inc h to move to the next field of a record and Inc l to move between records. You can also jump to specific points in either case by directly loading h or l with the relevant offsets.

fgbrain


Quote
Oh, I actually misread this - you're not done there because that's an 8-bit inc/dec, but it is faster in the case of over/underflow as it only has the one IX access in that case instead of the current two. Nice
my bad, you are right.


Back to original question,carry flag is not affected with the INC as well with DEC.


perhaps then some other flag is suitable for your task ??
_____

6128 (UK keyboard, Crtc type 0/2), 6128+ (UK keyboard), 3.5" and 5.25" drives, Reset switch and Digiblaster (selfmade), Inicron Romram box, Bryce Megaflash, SVideo & PS/2 mouse, , Magnum Lightgun, X-MEM, X4 Board, C4CPC, Multiface2 X4, RTC X4 and Gotek USB Floppy emulator.

eto

Quote from: fgbrain on 00:29, 25 October 21

perhaps then some other flag is suitable for your task ??


DEC affects the P/V flag, but only for 80h.


As A doesn't need to be preserved, might this work?



dec (ix+Struct1)
ld a,(ix+Struct1)
cp a,&ff
jr nz,nextThing
dec (ix+Struct1+1)
jr nextThing

MaV


Not tested, but should be a viable solution.


    xor a
    cp (ix+datastruct)
    jr nz, dec_low_byte_only
    dec (ix+datastruct+1)
dec_low_byte_only:
    dec (ix+datastruct)
    jr somewhereelse


NOPs when not jumping:
1
5
2
6
6
3
--
23

NOPs when jumping:
1
5
3
6
3
--
18
Black Mesa Transit Announcement System:
"Work safe, work smart. Your future depends on it."

MaV

However, your solution does not look arcane to me, @Cwiiis , if you'd just change the &ff to -1 and let the assembler do its work (I deleted the two unnecessary instructions, since you stated that A does not need to be preserved).
"Add -1 to the first byte, (save it), and if it underflows decrement the second byte" reads just fine.




    ld a, -1
    add (ix+Struct1)
    ld (ix+Struct1), a
    jr c,NextThing
    dec (ix+Struct1+1)
    jr NextThing



exit early:
2
5
5
3
--
15


decrement high byte as well:
2
5
5
2
6
3
--
23

Black Mesa Transit Announcement System:
"Work safe, work smart. Your future depends on it."

m_dr_m

#13
If you are fine with 15 bits values encoded bizarrely, you can test Sign flag after the first dec.
But that has no avantage over the -val encoding.

Cwiiis

Does seem like my last solution was fine really, perhaps I wasn't missing much in the end - in this particular case, struct of arrays doesn't really make sense and I think this is probably good enough. Could maybe save some cycles not using IX, at the expense of readability but at this point, it's definitely premature optimisation.

Thanks for all the discussion, has definitely helped :) I think I'm only a few routines away from being able to demo something, though limited time makes progress pretty slow...

Urusergi


    ld a, -1
    add (ix+Struct1)
    ld (ix+Struct1), a
    jr c,NextThing
    dec (ix+Struct1+1)
    jr NextThing


I really liked the routine, but at the same speed I think 14 is better than 15 bytes, isn't it?  8)


    xor a
    cp (ix+Struct1)
    dec (ix+Struct1)
    jr c,NextThing
    dec (ix+Struct1+1)
    jr NextThing



Cheers.

Prodatron

That looks promising, unfortunately the NOP you win with XOR A compared to LD A,-1 you loose with

dec (ix+Struct1)

instead of

ld (ix+Struct1), a

as second is just a write (5 Nops), while first one (6 Nops) is a read/write.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

Urusergi

Quote from: Prodatron on 00:32, 02 November 21
That looks promising, unfortunately the NOP you win with XOR A compared to LD A,-1 you loose with

dec (ix+Struct1)

instead of

ld (ix+Struct1), a

as second is just a write (5 Nops), while first one (6 Nops) is a read/write.


Yes, I'm aware, that's why I said -at the same speed-  ;D  but in this case we gain one byte  8)

Prodatron

Ops you are right!
So you win probably? your code is the most optimized!
Always cool to see that there is still some more optimization possible with Z80 code

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

m_dr_m

If you don't mind more variation in time taken and greater max time:



dec (ix+struct1)
jp p,.next
ld a,(ix+struct1)
inc a
jr nz,.next
dec (ix+struct1+1)
.next



Early exit 1: 9 (half of the time)
Early exit 2: 18 (~half of the time)
Early exit average: 13.5


Still slower than my other suggestions, though!

Cwiiis

Quote from: m_dr_m on 11:31, 02 November 21
If you don't mind more variation in time taken and greater max time:



dec (ix+struct1)
jp p,.next
ld a,(ix+struct1)
inc a
jr nz,.next
dec (ix+struct1+1)
.next



Early exit 1: 9 (half of the time)
Early exit 2: 18 (~half of the time)
Early exit average: 13.5


Still slower than my other suggestions, though!


Ah, this is nice and I hadn't thought of this :) I think I need to prepare for the worst branch to be taken, so this may or may not work, but there may be things I can do to mitigate that, in which case this could be the base for the best solution... Very nice, thanks!

Urusergi

Quote from: Prodatron on 02:24, 02 November 21Ops you are right!So you win probably? your code is the most optimized!Always cool to see that there is still some more optimization possible with Z80 code

Thanks, but it's the most optimized in size, speed isn't the best I would like. @m_dr_m is the winner  8)

Quote from: m_dr_m on 11:31, 02 November 21
dec (ix+struct1)
jp p,.next
ld a,(ix+struct1)
inc a
jr nz,.next
dec (ix+struct1+1)
.next



Early exit 1: 9 (half of the time)
Early exit 2: 18 (~half of the time)
Early exit average: 13.5

Very intelligent! I like it so much. I think your code is the fastest possible, according to the requirements of @Cwiiis

m_dr_m

#22

Edit: That's basically @andycadley 's suggestion!
Edit2: With hl, the add -1 trick is shorter and very slightly faster.

Indexed addressing is so slow that you might consider doing your own indexing.
If you have few structs, you can align them every &100.



; H = Struct MSB
ld l,struct1
ld a,(hl):add -1;ld (hl),a
jr c,.next
[...]



If you have few fields, and no more than 128 structs (or to be more precise, 256/n where n=size of biggest field):



; L = Struct LSB
ld h,struct1
[...]


Dropping the average time to ~11 nops.

Powered by SMFPacks Menu Editor Mod