measure time of assembler code

HAL6128 · 15:05, 18 November 13

Hi.
It's just a question from an assembler newbie (me).
I have following code and want to know which one is faster?? (The code only fills a blank screen with zeros.)

Code Select

 
org &8000

     ld b,80
     ld c,25
l1: ld a,&30
    call &bb5a
    dec c
    jp nz,l1
    ld c,25
    dec b
    jp nz,l1
    ret
 

org &8100
    ld bc,&07d0
l2: ld a,&30
    call &bb5a
    dec bc
    ld a,b
    or c
    jp nz,l2
    ret

and with that BASIC code I try to measure which one is faster:

Code Select

 
10 t1=TIME:CALL &8000:t2=TIME
20 PRINT (t2-t1)/300
30 CALL &BB18
40 MODE 2:t1=TIME:CALL &8100:t2=TIME
50 PRINT (t2-t1)/300
60 CALL &BB18

The first code snippet (from &8000) is 18 bytes long and the second (from &8100) is 15 bytes long. But it seems that the second is slower or at least both are equal. (I assume that the firmware call (&bb5a) is the bottle neck??)
It's just a principle question. Is it wrong to measure code with such a method (from BASIC) or do you count each command by sum-up all cycles times together to say (to feel) if you create fast code?

MaV · 15:51, 18 November 13

I'm going to dismiss the loading operations at the beginning of each program and the RET.

first program:
-----------------

Code Select


ld a, &30  = 2µs
call &bb5a = 5µs
dec c      = 1µs
jp nz, l1  = 3µs

ld c,25   = 2µs
dec b     = 1µs
jp nz, l1 = 3µs

The first block is an inner loop, the second part belong to the outer loop, thus : (((2+5+1+3 * 25) + 2+1+3) *80 -> ((11*25) + 6) * 80 = 22480

Code Select


ld a, &30  = 2µs
call &bb5a = 5µs
dec bc     = 2µs
ld a,b     = 1µs
or c       = 1µs
jp nz, l2  = 3µs

There's only one loop which is repeated 2000 times (= &7d0):

(2+5+2+1+1+3) * 2000 -> 14 * 2000 = 28000

The first program (22480) will be faster than the second one (28000).

Careful! The values are in no way the actual execution times, since the routine at BB5A will take a lot of cycles. But for the above consideration it does not matter since it will be called an equal number of times in both routines.

Urusergi · 15:52, 18 November 13

The first one is better, and can be improved:

Code Select


org &8000

    ld bc,&5019
    ld a,&30
l1: call &bb5a
    dec c
    jr nz,l1
    ld c,25
    djnz l1
    ret

EDIT: Don MaV, are you spanish?

even better:

Code Select

org &8000

    ld a,&30
    ld c,&08
l2: ld b,&FA
l1: call &bb5a
    djnz l1
    dec c
    jr nz,l2
    ret

HAL6128 · 17:06, 18 November 13

Ahh, yes!

@MaV: Thanks for giving me the right perspective (reflecting the inner & outer loops). That helps understandig ...counting hard facts (cycle times) with help of brackets.

@Urusergi: DJNZ! Of course, thanks to you!

"Again what learned..."

Bruce Abbott · 19:41, 18 November 13

Quote from: HAL 6128 on 15:05, 18 November 13
Is it wrong to measure code with such a method (from BASIC) or do you count each command by sum-up all cycles times together to say (to feel) if you create fast code?

Measuring the execution time from BASIC is not wrong, but you have to realize its limitations.

1. The timer is only accurate to 1/300th of a second, so you need to repeat a code snippet many times to get good accuracy.

2. System interrupts will add about 10% to the total time. You can account for system overhead by timing a bare routine without your test code (eg. just a RET), then subtracting it from the results.

Summing all cycles is more accurate, but is only practical for timing small snippets of your own code.

An easier way to get cycle accurate timing is to run your code in WinAPE and use its 't' counter. Set breakpoints at the start and end of the code section you want to time, reset the counter when the first breakpoint is reached, then run to the second breakpoint and read the timer. You should also disable interrupts (either in your code or by unticking 'ints' in WinAPE). Note that some ROM routines re-enable interrupts when they are called.

QuoteI assume that the firmware call (&bb5a) is the bottle neck??

Yes, it is a huge bottle neck, and furthermore it doesn't always take the same amount of time to execute. In your test program the second code snippet is being penalized because the screen has already been filled with '0's (scrolling is slower than filling an empty screen). Add a MODE command to reset the screen before each call, and you will then find that both routines have almost identical timing!

AMSDOS · 04:57, 19 November 13

Quote from: MaV on 15:51, 18 November 13

Code Select Expand
call &bb5a = 5µs

Are all Firmware Calls the same speed or does that relate to the speed of the TXT OUTPUT?

I know with that particular routine it's preserving registers, so using &BB5D (TXT WR CHAR) might improve results, cause you only need to preserve the registers your using (unless your using all of them).

HAL6128 · 10:17, 19 November 13

Quote from: Bruce Abbott on 19:41, 18 November 13
An easier way to get cycle accurate timing is to run your code in WinAPE and use its 't' counter. Set breakpoints at the start and end of the code section you want to time, reset the counter when the first breakpoint is reached, then run to the second breakpoint and read the timer. You should also disable interrupts (either in your code or by unticking 'ints' in WinAPE). Note that some ROM routines re-enable interrupts when they are called.
Yes, it is a huge bottle neck, and furthermore it doesn't always take the same amount of time to execute. In your test program the second code snippet is being penalized because the screen has already been filled with '0's (scrolling is slower than filling an empty screen). Add a MODE command to reset the screen before each call, and you will then find that both routines have almost identical timing!

Thanks for the hint with WinAPE. It's a great help and much more accurate than measuring by BASIC!
I tried to disable the interrupts. But no effect. Maybe a firmware call which re-enable them?
And yes, I forgot to add a MODE command for the first snippet. Originally I wanted to fill a blank screen. Just forgot it.

Quote from: AMSDOS on 04:57, 19 November 13
Are all Firmware Calls the same speed or does that relate to the speed of the TXT OUTPUT?

I know with that particular routine it's preserving registers, so using &BB5D (TXT WR CHAR) might improve results, cause you only need to preserve the registers your using (unless your using all of them).

I also tried the first routine with the firmware call &BB5D (1382 't' cycles). It takes less 't' cycles than with &BB5A (1441), allthough I preserved the register with PUSH/POP.
What MaV tried to explain is that it doesn't matter how much time the firmware routine takes behind as long as I made a comparison between equal calls. So, "5µs" are only for the opcode. Am I right?

arnoldemu · 10:47, 19 November 13

Timing like this is not wrong.

In another thread I did a similar thing but in assembler, avoiding the overheads of the firmware to give a more accurate measurement.

It is more suited to algorithms that take longer than a frame to execute, but which you want to make faster.

If it's within a frame, I use the border change method because you can visually see the time it takes, and when you make it faster the thickness of the bar in the border becomes thinner. Success is making it as thin as possible.

Cycle counting is good too, but I find it is only really worth counting cycles exactly if the code is short - split it into manageable sizes and do it that way. if the code is complex, you'll get worried over single microseconds here and there, when you should be concentrating on the bigger picture.

Often more speed can be achieved by changing the algorithm to something that is faster, then optimise.

e.g. a word sorting routine.

You need to find the best sort method *then* find out how to optimise it.

In your example, the txt output takes a lot of time. Part of the overhead is calling into the rom (it takes a few 100 cycles to page in the rom and get to the function, then a few 100 after the function and exit back to your code). Then the function itself can execute control codes AND it reads the charset from the rom and converts it in real time into mode 1 letters colouring as it does them.

A faster routine avoids going into the firmware and avoids having to reconvert the data to be drawn to the screen.

arnoldemu · 10:48, 19 November 13

TXT WR CHAR is faster, because it's the function that TXT OUTPUT calls when it actually wants to write a character to the screen. You're avoiding the other tests it does, but you still have the overhead of calling into the rom and it's method for drawing characters.

Bruce Abbott · 11:37, 19 November 13

Quote from: AMSDOS on 04:57, 19 November 13
Are all Firmware Calls the same speed or does that relate to the speed of the TXT OUTPUT?

That 5us just for the CALL and RET instructions. Of course the function being called will take much longer.

QuoteI know with that particular routine it's preserving registers, so using &BB5D (TXT WR CHAR) might improve results, cause you only need to preserve the registers your using (unless your using all of them).

Saving and restoring registers doesn't take much time compared to all the other stuff that has to be done. However a lot of that stuff may not be necessary. If you don't need control code parsing or a cursor, and can keep track of row and column numbers yourself, then TXT WRITE CHAR is nearly twice as fast. But to get really fast printing you have to bypass the ROM routines and render directly to the screen.

Quote from: arnoldemuOften more speed can be achieved by changing the algorithm to something that is faster, then optimise.

Yes, and often you can speed things up by only doing exactly what you have to in your particular situation. The ROM text routines are general purpose and handle many different situations, but that makes them much slower.

MaV · 13:14, 19 November 13

Quote from: AMSDOS on 04:57, 19 November 13
Are all Firmware Calls the same speed or does that relate to the speed of the TXT OUTPUT?

The 5µs is the time for the call instruction. The routine at BB5A can easily take a few hundred µs. arnoldemu and the others in the thread explained this quite well.
I sort of "abstracted it away" because the routine is called an equal numbers of times (2000) with the same parameter in both of HAL's versions, thus it takes (almost) constant time to execute at a blank screen. And since it's a ROM routine you can't optimise it anyway, though as others have pointed out there are better calls.

Urusergi did an even better version. The main point here is to decrease the number of cycles for the inner loops.

Cycle counting lends itself well to small code snippets. Anything more complex should be optimised differently (better algorithms).

@Urusergi: I'm not Spanish. But my avatar certainly is, and he quite looks like a "Don".

HAL6128 · 15:54, 19 November 13

Quote from: arnoldemu on 10:47, 19 November 13
A faster routine avoids going into the firmware and avoids having to reconvert the data to be drawn to the screen.

Let's have a go... . I have a different (faster) approach for filling a blank screen with zero...

Hopefull I understand you right by measuring asm-code.

The inner loop (printing a character) takes: (5+2+3+3+3)*8*2000 = 256000 (t or us?)
The outer loop (calculating new screen address) takes: (4+3+4+2+2+1+2+3)*2000 = 42000 (t or us?)
The result is 0,314 seconds (~3 frame per second)

If I measure the same source-code in BASIC t1=TIME:CALL &8000:t2=TIME:PRINT (T2-T1)/300) I come to the result 0,1333 (7,5 frames per second)

I obviously made a mistake somewhere?

@arnoldemu: Did you mean your example on your homepage ("Border colours indicating the interrupt position and how the vsync relates to the position") ? (Unofficial Amstrad WWW Resource). How can I do something like that with the code above? (it's definitly longer than a frame).

Code Select

 
  org &8000
 ld a,&02
 call &bc0e            ;mode 2
 
 di

;start measuring here...

 ld bc,&7D0            ;(3t / 4us) counter> 2000 x "0"
 ld hl,&c000           ;(3t / 4us) start screen address
 ld de,&800            ;(3t / 4us)offset
 ld ix,zero            ;(4t / 6us)character adress of "0"
 
l1:                    ;outer loop
 push hl               ;(4t / 3us)

l2:                    ;inner loop
  ld a,(ix+&00)        ;(5t / 5us)
  ld (hl),a            ;(2t / 3us)
  inc ix               ;(3t / 3us)
  add hl,de            ;(3t / 3us)
  jr nc,l2             ;(3t / 3us) inner loop end
 
 pop hl                ;(3t / 3us)
 inc hl                ;(2t / 2us)
 ld ix,zero            ;(4t / 6us)
 dec bc                ;(2t / 2us)
 ld a,b                ;(1t / 1us)
 or c                  ;(2t / 1us)
 jp nz,l1              ;(3t / 3us) outer loop end

;end measuring here...

 ei
 ret
zero:
 defb &38,&44,&4c,&54,&64,&44,&38,&00

Bruce Abbott · 18:19, 19 November 13

Quote from: HAL 6128 on 15:54, 19 November 13The inner loop (printing a character) takes: (5+2+3+3+3)*8*2000 = 256000 (t or us?)
The outer loop (calculating new screen address) takes: (4+3+4+2+2+1+2+3)*2000 = 42000 (t or us?)
The result is 0,314 seconds (~3 frame per second)

If I measure the same source-code in BASIC t1=TIME:CALL &8000:t2=TIME:PRINT (T2-T1)/300) I come to the result 0,1333 (7,5 frames per second)

You are disabling interrupts, but they are needed to operate the 300Hz timer. With interrupts enabled the execution time is reported as 0.4533 seconds. Removing the loop code gives a time of 0.13 seconds for setup overhead, so the loop code should be taking about 0.323 seconds.

Interrupt overhead must account for some of that time. According to WinAPE the loop code takes ~321590us with ints on, but only 294000us with ints off. Therefore the interrupt overhead is just over 9%.

Unfortunately SCR SET MODE enables interrupts, so I removed it and added a MODE command in BASIC before calling the machine code. Now the execution time with ints enabled is reported as 0.3233 seconds, with a setup overhead of just 0.003 seconds.

Bruce Abbott · 18:56, 19 November 13

Your code can easily be improved.

The most obvious speed up is to make better use of IX. I unrolled the inner loop like this:-

Code Select

 ld a,(ix+0)
 ld (hl),a
 add hl,de
  
 ld a,(ix+1)
 ld (hl),a
 add hl,de

 ld a,(ix+2)
 ld (hl),a
 add hl,de
 ...

 ld a,(ix+7)
 ld (hl),a

This also eliminates the need to reload IX in the outer loop. Execution time dropped to 0.207 seconds.

Next I tried eliminating IX altogether and hard coding the character matrix, like so:-

Code Select

 ld a,&38
 ld (hl),a
 add hl,de
  
 ld a,&44
 ld (hl),a
 add hl,de
... 

 ld a,&00
 ld (hl),a

Now it's taking a mere 0.153 seconds to fill the screen.

Urusergi · 23:00, 19 November 13

Another (tricky) way, that I found out this morning:

Code Select

org &8000
    ld a,2
    call &bc0e
    ld a,&30
    call &bb5d
    ld hl,&c000
    ld d,h
    ld e,l
l1: inc de
    ld bc,&07FF
    ldir
    inc hl
    ld a,h
    or l
    jr nz,l1
    ret

@MaV: Yes! it looks like Don Luis de Góngora was a very sobersided person

AMSDOS · 08:40, 20 November 13

Quote from: Bruce Abbott on 11:37, 19 November 13But to get really fast printing you have to bypass the ROM routines and render directly to the screen.

The Article on the Wiki regarding Fast Sprites has some seriously fast code in there, unfortunately I haven't been able to follow it.

Earlier on I was using a Sprite Routine successfully which uses LDI to draw to screen (using minimal Firmware), it could be possible to generate an graphical alphabet if desirable.

News:

measure time of assembler code