Optimizing my Compiler

Trebmint · 20:34, 06 November 20

As some of you might know I've been writing Quigs which is a IDE for Symbos. I'm now at the stage where I'm coding the optimizer for the assembler output. I'm writing a small app and then taking the output and slowly going through trying to find code that can be made quicker.

The last one has me a little baffled as I've modified code the benchmark Sieve of Eratosthenes into Quigs... However there's some speed I'm missing somewhere as the Benchmarks for the C version are giving around .9 to 1.3 seconds on a Z80 and my code about 1.9secs. These times come from https://github.com/z88dk/z88dk/tree/master/libsrc/_DEVELOPMENT/EXAMPLES/benchmarks/sieve

Code Select


***Quigs code*** 

Var prime(8000):Byte
Var i:Int
Var i_sq:Int
Var k:Int
Var h:Int
Var Size:Int=7999
Var E:String
// End declares
Function Button1.Event:Void()
   i_sq=4
   For i=2 To Size
      If prime(i)=0 Then
         i_sq = i+i
         For k=i_sq To Size Step 0
            prime(k)=1
            k=k+i
         Next
      EndIf
   Next
   Button1.Text.RSet "Done"   
EndF

And this the Z80 it spits out. Can anyone see some speedups?

Code Select


.QFunc_BUTTON1_EVENT
.QLn_17      ;//i_sq=4
      ld      hl,4      ;//hl=number '4' [Qc054]
      ld      (var_wrd_i_sq),hl      ;//Int variable 'var_wrd_i_sq'=expression result [Qc219]
.QLn_18      ;//For i=2 To Size
      ld      hl,(var_wrd_Size)    ;//hl=value of 'var_wrd_Size' [Qc053]
      ld       (QFr788_Top+1),hl       ;/Copy Result into static'var_wrd_i' [Qc303]
      ld      hl,2      ;//hl=number '2' [Qc054]
      .QFr788_Loop:ld       (var_wrd_i),hl ;/Loop Repeat Point [Qc301]
      .QFr788_Top:ld      de,0       ;//Compare value into DE  [Qc300]
      xor      a         ;// [Qc305]
      sbc      hl,de      ;//[Qc306]
      jp      c,QFr788_Start   ;// [Qc307]
      jp      nz,QFr788_Exit   ;// [Qc308]
.QFr788_Start:
.QLn_19      ;//If prime(i)=0 Then
      ld      hl,(var_wrd_i)    ;//hl=value of 'var_wrd_i' [Qc053]
      ld      bc,ar1_byt_prime+2      ;//Byte Array 'prime(x)' data start address [Qc125]
      add      hl,bc      ;//Add Offset + Address [Qc126]
      Ld       a,(hl)            ;// Optimize -13- [Qc1090]
      cp       0         ;//  [Qc1091]
      jp       nz,QIf812_Exit         ;//  [Qc1092]
.QLn_20      ;//i_sq = i+i
      Ld       hl,(var_wrd_i)      ;// Optimize -14- [Qc1093]
      add      hl,hl            ;//  [Qc1094]
      ld      (var_wrd_i_sq),hl      ;//Int variable 'var_wrd_i_sq'=expression result [Qc219]
.QLn_21      ;//For k=i_sq To Size Step 0
      ld      hl,(var_wrd_Size)    ;//hl=value of 'var_wrd_Size' [Qc053]
      ld       (QFr822_Top+1),hl       ;/Copy Result into static'var_wrd_k' [Qc303]
      ld      hl,(var_wrd_i_sq)    ;//hl=value of 'var_wrd_i_sq' [Qc053]
      .QFr822_Loop:ld       (var_wrd_k),hl ;/Loop Repeat Point [Qc301]
      .QFr822_Top:ld      de,0       ;//Compare value into DE  [Qc300]
      xor      a         ;// [Qc305]
      sbc      hl,de      ;//[Qc306]
      jp      c,QFr822_Start   ;// [Qc307]
      jp      nz,QFr822_Exit   ;// [Qc308]
.QFr822_Start:
.QLn_22      ;//prime(k)=1
      ld      hl,(var_wrd_k)    ;//hl=value of 'var_wrd_k' [Qc053]
      ld      bc,ar1_byt_prime+2      ;//Byte Array 'prime(x)' data start address [Qc125]
      add      hl,bc         ;// Optimize -12- [Qc1084]
      Ld       a,1      ;//  [Qc1085]
      Ld       (hl),a         ;//  [Qc1086]
.QLn_23      ;//k=k+i
      ld      hl,(var_wrd_k)    ;//hl=value of 'var_wrd_k' [Qc053]
      ex      de,hl       ;// [Qc269]
      ld      hl,(var_wrd_i)    ;//hl=value of 'var_wrd_i' [Qc053]
      add      hl,de      ;//Add Integer HL & DE Values [Qc058]
      ld      (var_wrd_k),hl      ;//Int variable 'var_wrd_k'=expression result [Qc219]
.QLn_24      ;//Next
      ld      hl,(var_wrd_k) ;// [Qc330]
      jp      QFr822_Loop      ;// [Qc340]
.QFr822_Exit:
.QLn_25      ;//EndIf
.QIf812_Exit:
.QLn_26      ;//Next
      ld      hl,(var_wrd_i) ;// [Qc330]
      inc      hl      ;// [Qc331]
      jp      QFr788_Loop      ;// [Qc340]
.QFr788_Exit:
.QLn_27      ;//Button1.Text.RSet "Done"
      ld      hl,Button1      ;//hl=address of 'Button1' [Qc223]
      push      hl      ;//Store result [Qc005]
      ld      hl,(Button1_Text)      ;//hl=value of 'Button1_Text' [Qc222]
      ex      de,hl       ;// [Qc232]
      ld      hl,directstr_1+2      ;//hl points static string [Qc055]
      call   Quig_LetString_Refresh            ;//Call 'Button1.Text.RSet' [Qc234]
.QLn_28      ;//EndF
      ret      ;// [Qc262]

AMSDOS · 21:52, 06 November 20

Some of those 'jp's are right on the doorstep of where they need to get to, so I guess they could be optimized to 'jr'?
There's also a 'cp 0' which could be replaced with 'and a' or 'or a' I guess?

Trebmint · 22:01, 06 November 20

Quote from: AMSDOS on 21:52, 06 November 20
Some of those 'jp's are right on the doorstep of where they need to get to, so I guess they could be optimized to 'jr'?
There's also a 'cp 0' which could be replaced with 'and a' or 'or a' I guess?

But a Jr is slower 12/7 as compared to 10

AMSDOS · 23:49, 06 November 20

Quote from: Trebmint on 22:01, 06 November 20
But a Jr is slower 12/7 as compared to 10

I don't know because I don't have the means to test that.

I have a book that states JP uses 10 clock cycles and that a JR uses 7 if condition isn't met otherwise it's 12 cycles.
What I'm unsure about is someone here argued that those clock cycles don't work like that on the CPC, though I could only find the discussion relating to between LDIR and LDI instructions with LDIR using 24 cycles instead of 21 and LDI being 20 instead of 16 and then I guess I planted the idea that instructions worked in cycles of 4, though that wouldn't be true with LDI since 16 goes into 4, 4 times.

Sadly I've worked real hard on my own language to the point if I can use an JR <condition>, JR <condition> is used instead of JP <condition> solely on the word of someone suggesting clock cycles work in multiples of 4, but I guess if JP became 12 cycles, would a True JR become 16?

Urusergi · 23:55, 06 November 20

Quote from: Trebmint on 22:01, 06 November 20
But a Jr is slower 12/7 as compared to 10

That is true on a spectrum, but on a cpc:
JR: 12/8 (3/2us)
JP: 12/12 (3/3us)

Trebmint · 23:58, 06 November 20

Quote from: Urusergi on 23:55, 06 November 20
That is true on a spectrum, but on a cpc:
JR: 12/8 (3/2us)
JP: 12/12 (3/3us)

Cool JR wins then. Dont tell Bobby Ewing

shaymanjohn · 08:13, 07 November 20

Quote from: Urusergi on 23:55, 06 November 20
That is true on a spectrum, but on a cpc:
JR: 12/8 (3/2us)
JP: 12/12 (3/3us)

Is this true? Did not know it was different on a CPC. Is there a table of Z80 timings for the CPC somewhere? Thanks!

AMSDOS · 08:51, 07 November 20

The "Master Machine Code on your Amstrad CPC464 & 664" have a Table of Instructions and their timings Appendix 1 of the book, though they made a note 'the clock cycles are given for comparision purposes only as the clock speed is affected by the video generating process.'

trocoloco · 09:05, 07 November 20

For CPC timings this can help

https://wiki.octoate.de/lib/exe/fetch.php/amstradcpc:z80_cpc_timings_cheat_sheet.20131019.pdf

SRS · 22:02, 07 November 20

Is it native on Z80 ? If not, you could maybe convert to C first and use existing crosscompilers which optimize a lot

- or have a look at their optimizers

for the piece of code MDL says:
e:\cpc\experimentefuerwiki>java -jar mdl.jar org_quig_sieve.asm -po -cpu z80cpc
INFO: Pattern-based optimization in org_quig_sieve.asm#21: Replace cp 0 with or a (1 bytes, 1 nops saved)
INFO: PatternBasedOptimizer: 1 patterns applied, 1 bytes, 1 nops saved.
And maybe:

Code Select


;Instead of ld de,767 or a       ;reset carry so sbc works as a sub sbc hl,de ;try this ld de,-767 ;negation of de add hl,de ; -> 2 bytes and 8 T-states !More links ici : http://www.cpcwiki.eu/index.php/A_little_optimization_for_Z80_Assembler

Trebmint · 23:22, 07 November 20

Quote from: SRS on 22:02, 07 November 20
Is it native on Z80 ? If not, you could maybe convert to C first and use existing crosscompilers which optimize a lot - or have a look at their optimizers

for the piece of code MDL says:
e:\cpc\experimentefuerwiki>java -jar mdl.jar org_quig_sieve.asm -po -cpu z80cpc
INFO: Pattern-based optimization in org_quig_sieve.asm#21: Replace cp 0 with or a (1 bytes, 1 nops saved)
INFO: PatternBasedOptimizer: 1 patterns applied, 1 bytes, 1 nops saved.
And maybe:

Code Select Expand
ld de,0 ;//Compare value into DE [Qc300] xor a ;// [Qc305] sbc hl,de ;//[Qc306] to ld de,-0 ;negation of de add hl,de
More links ici : http://www.cpcwiki.eu/index.php/A_little_optimization_for_Z80_Assembler

Of course ld de,- obvious when you think about it. Cheers

SRS · 16:37, 08 November 20

Sorry, my fault - copypaste at night. I just corrected the source: https://wikiti.brandonw.net/index.php?title=Z80_Optimization#Optimize_size_and_speed

;Instead of ld de,0 or a       ;reset carry so sbc works as a sub sbc hl,de ;try this ld de,-0 ;negation of de (if not 0  ) add hl,de ; -> 2 bytes and 8 T-states !

zhulien · 15:05, 13 November 20

I also like writing compilers, but what i can say, is I hate the code that my compilers write when compared to what I can code in native assembler, and I am not even a good assembler coder in comparison to a lot of other talent around here.

Having said this, some of my thoughts which can help (not related to your specific tight-looped sieve example though).

Provide an #asm and #endasm directive so you can put inline assembler as you like. After all, are we optimising for speed or for memory usage? I think for most part focus on speed, if someone wants memory usage - don't use a high level language - or buy more RAM (symbos supports 2mb already).

Provide an optimization for the following 3 loop constructs: counting forward and backward as well x number of times - since you are coding your own language, maybe an easy way is to this is special construct keywords or variants of a standard.

Provide the option of passing function parameters in direct nominated registers, e.g. perhaps you could call them reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_bc, reg_de, reg_hl, reg_ix, reg_iy as well as a return in a nominated register - and lose recursion ability at the same time - give the programmer the choice. This would allow a mixture of high-level code with fairly low level code in the one understandable application.

Provide expression handling code that is based on your usual infix to reverse polish notation (for complex expressions) but also provide a direct infix translation option for expressions which of course is not truely mathematically correct but ends up with register-based expression calculations rather than the generic stack approach. You could also have a couple of optimisations for myvar++, myvar--, myvar *= 2, myvar /= 2, as well as some common multiples such as 4, 8, 16, 32, 64, 128, 256.

Provide a mechanism for switch statements in an optimised way - now this to be optimised has different methods that suit different situations, eg:

switch (myexpression)
{
case 'a': { blah; break; }
case myexpression2: { blah; break; }
}

if you have 50 cases, you don't really want the 50th one to be reached 50 times slower than the first in all cases, so the switch needs a mechanism to indicate... are we jumping based on sets of expressions? (sometimes you need to evaluate all one at a time), are we jumping based on a set of literals (can be a calculated table jumpblock in which the 1st and 1000th jump takes the same timeframe. Some dialects let you treat the switch bit as an expression so you are comparing 2 expression results where-as some dialects only let you put a literal or single variable in the switch.

If you have classes, allow for classes to be single instance - instantiated once for code readability mainly, rather than dynamically allocated on the heap.

For memory allocations, are you allowing for far addresses (in any part of the symbos 2mb space) or only within the application's local 64kb block? If far, I am working on this issue currently to make it perform fast for both allocations as well as freeing - using a RAM allocation table (RAT) instead of a linked list of memory blocks. There are pros and cons of both, but we are talking about a slow z80 with megabytes of RAM (potentially), not a fast z80 and not a tiny RAM (potentially). So, byte aligning some allocations using a RAM allocation table (similar to a disc file allocation table) quickly allows you to allocate/deallocate blocks of memory without having to traverse 2mb of linked lists... for a program that explicitly allocates/deallocates RAM, you can tell the programmer, do that intelligently so you don't waste too much CPU - but... if the compiler is doing this often because of object creation/destruction, then the RAT method is lightening fast compared to a linked list.

For memory copies which are going to be necessary on a CPC with lots of RAM, sometimes you are copying data from one bank into another - if you put a 256 byte buffer in the RAM that is common to the banks (i.e. bank 7?) then even if you are doing a block copy of 16kb via the 256 byte buffer 256 bytes at a time, you will be immensely fast compared to switching banks twice for every byte to copy (which is crazy). A 256 byte buffer would only mean around 128 bank switches max (64 x 2) to copy 16kb vs 32768 bank switches if bank switching twice for every byte. This is more algorithm related than code as a result of the coder, but... if you are doing block copies hopefully your compiler libraries are doing things somewhat intelligently.

News:

Optimizing my Compiler