Making C compiler aware of CPC Firmware calling convention ?

cpcitor · 12:18, 09 April 14

Thanks to @Alcoholics Anonymous for the insightful comments on thread linked below.

Quote from: Alcoholics Anonymous on 07:25, 09 April 14
z88dk has the tradition small C linkage but also introduces two more: FASTCALL which allows a single parameter to be passed by register and CALLEE which allows the called function to clear up the stack. When using a standard C linkage (whether R->L or L->R parameter order), those extra bytes required after the call to the function to clear up the stack can add up to hundreds of bytes in a large program. These extra alternatives can shave a lot of bytes off code size, particularly for z88dk as best practice involves making lots of calls to library routines.

Regarding other calling conventions, I had noticed z88dk's FASTCALL and CALLEE and yes it would be nice to have them on SDCC. See for example http://www.z88dk.org/wiki/doku.php?id=optimization .

Yet another calling convention is used by the CPC firmware : just use registers as you want, ensure (or not) that some have their value unchanged on return and make that part of the contract.

That would be complicated for the compiler to support (to implement that one must know the compiler internals) and that could not be used with pointer-to-function (we need one calling convention to be used for all call to pointers-to-function as observed in http://www.z88dk.org/wiki/doku.php?id=optimization -- z88dk gets away by hand-writing several versions of the library functions). Yet it would allow much tighter optimization or generated code.

In practice instead of writing e.g. wrappers to firmware routines like this:

Code Select

.module fw_gra_line_absolute

_fw_gra_line_absolute::
        ld      hl,#2
        add     hl,sp
        ld      a,(hl)
        ld      e,a

        ld      hl,#3
        add     hl,sp
        ld      a,(hl)
        ld      d,a

        ld      hl,#4
        add     hl,sp
        ld      a,(hl)
        ld      c,a

        ld      hl,#5
        add     hl,sp
        ld      a,(hl)
        ld      b,a

        ld      h,b
        ld      l,c
        call    0xBBF6  ; GRA LINE ABSOLUTE
        ret

with user code including a header file that contains this:

Code Select

void fw_gra_line_absolute(int x, int y);

we would not need a wrapper at all and just have user code including a header file that contains this:

Code Select

void __MODS_AF__ __MODS_BC__ __MODS_DE__ __MODS_HL__ fw_gra_line_absolute(__ARG_DE__ int x, __ARG_HL__ int y);

... and compiler would figure out the rest. Instead of generating that pesky long code that can't be optimized properly with peephole rules, it could maybe just optimize called code as it optimizes local (or inlined) code, that is: "The compiler has a table stating for each CPU instruction what are its input, output and other changes. Now this function call is like a sort-of big instruction that happens to have inputs, outputs and other changes. Let's apply standard register allocation algorithms to that."

AMSDOS · 06:40, 10 April 14

On my website I made a little Turbo Pascal program which takes some Machine Code from a Constant Array and pokes it to memory. It only does one particular thing (change Screen Mode), though I just thought that given it has a direct approach and values could be fed through variables and poked into memory and then executed with a Firmware Instruction.

The setback is probably the setup. In Hisoft Pascal 4T it's possible to use Firmware without writing any Assembly. What Hisoft have done in that case is make it possible to write into the Registers (either as Pairs or a Single Register), and then the firmware address can be called with the appropriate values in the registers. If the firmware specifically returns a value then that can also be obtained from the Register the Firmware function uses. It's quite clever that Hisoft did things this way because they wanted to give the programmer enough space to write their program instead of filling it up with Firmware routines which might not even had been used.

Alcoholics Anonymous · 08:08, 11 April 14

Quote from: cpcitor on 12:18, 09 April 14
In practice instead of writing e.g. wrappers to firmware routines like this:

Code Select Expand
_fw_gra_line_absolute: ld hl,#2 add hl,sp ld a,(hl) ld e,a ld hl,#3 add hl,sp ld a,(hl) ld d,a ld hl,#4 add hl,sp ld a,(hl) ld c,a ld hl,#5 add hl,sp ld a,(hl) ld b,a ld h,b ld l,c call 0xBBF6 ; GRA LINE ABSOLUTE ret

You can do this another way:

Code Select


_fw_gra_line_absolute:

pop af
pop de   ; de = x
pop hl    ; hl = y

push hl
push de
push af

jp 0xbbf6

Quote
we would not need a wrapper at all and just have user code including a header file that contains this:

Code Select Expand
void __MODS_AF__ __MODS_BC__ __MODS_DE__ __MODS_HL__ fw_gra_line_absolute(__ARG_DE__ int x, __ARG_HL__ int y);

... and compiler would figure out the rest. Instead of generating that pesky long code that can't be optimized properly with peephole rules, it could maybe just optimize called code as it optimizes local (or inlined) code, that is: "The compiler has a table stating for each CPU instruction what are its input, output and other changes. Now this function call is like a sort-of big instruction that happens to have inputs, outputs and other changes. Let's apply standard register allocation algorithms to that."

The problem with that is you are placing a lot of register pressure on the compiler. If it must have x in de, when it puts x there, it can no longer use de to get y into hl. Getting y may involve a computation which cannot be done without the help of de, so de gets pushed on the stack and recovered after hl is loaded with y. With too many constraints, it may have been better just to push the parameters on the stack as they are figured out and let the target routine put them in the right registers. That's the idea behind callee:

Code Select


_fw_gra_line_absolute:

pop hl       ; hl = return address
pop de       ; de = x
ex (sp),hl   ; hl = y

jp 0xbbf6

The parameters are collected off the stack and the caller doesn't have to remove them. That's the minimum amount of fuss when nothing is passed by registers.

The next step up is passing one parameter in register.

Your method is specifying all parameters in registers, but that pressure on the compiler thing may make things worse rather than better.

Communication of altered registers back to the compiler may be something in the long run.

Alcoholics Anonymous · 08:10, 11 April 14

Quote from: AMSDOS on 06:40, 10 April 14
On my website I made a little Turbo Pascal program which takes some Machine Code from a Constant Array and pokes it to memory. It only does one particular thing (change Screen Mode), though I just thought that given it has a direct approach and values could be fed through variables and poked into memory and then executed with a Firmware Instruction.

That's another way but reading/writing to a static block of memory like that is also slower than pushing/popping via the stack. So unless the firmware reads from a block of data, it is better to push/pop via the stack to load registers.

AMSDOS · 10:34, 11 April 14

Quote from: Alcoholics Anonymous on 08:10, 11 April 14
That's another way but reading/writing to a static block of memory like that is also slower than pushing/popping via the stack. So unless the firmware reads from a block of data, it is better to push/pop via the stack to load registers.

That's where I thought there maybe some lag in order to set it up.

In the Small-C cpciolib.c library there's this routine which starts off a bit like your example:

Code Select


oscall(adr,regpack)
int adr;
int *regpack; /* af,hl,de,bc */
{
#asm
 pop bc  ; ret
 pop de  ; regs
 pop hl  ; adr
 push hl
 push de
 push bc

 ld (031h),hl  ; rst 30h: user
 ld a,0c3h
 ld (030h),a

 push de
 ex (sp),ix

 ld l,(ix+0)
 ld h,(ix+1)
 push hl
 pop af

 ld l,(ix+2)
 ld h,(ix+3)
 ld e,(ix+4)
 ld d,(ix+5)
 ld c,(ix+6)
 ld b,(ix+7)

 ex (sp),ix

 rst 30h      ; execute os-call

 ex (sp),ix

 ld (ix+2),l
 ld (ix+3),h
 push af
 pop  hl
 ld (ix+0),l
 ld (ix+1),h
 ld (ix+4),e
 ld (ix+5),d
 ld (ix+6),c
 ld (ix+7),b

 pop de
 JP  CCSXT   ;move A to HL & sign extend

#endasm
}

and then the firmware routines can be setup after that:

Code Select


mode(n) int n;
{
 regs[0]=n << 8;
 oscall(SET_MODE_SCR,regs);
}

SET_MODE_SCR has been defined globally in cpciolib.h file which has the decimal number for &BC0E (that particular Small-C doesn't support Hexadecimal - or I think it was the macro-assembler it used which didn't like those number I think).

Anyway in that situation a Draw Routine would look like this:

Code Select


draw(x,y) int x,y;
{
 regs[1]=y;
 regs[2]=x;
 oscall(LINE_ABS_GRA,regs);

cpcitor · 11:13, 11 April 14

Thank you again @Alcoholics Anonymous, your writings are always very interesting.
My remarks below.

Quote from: Alcoholics Anonymous on 08:08, 11 April 14
You can do this another way:

Code Select Expand
_fw_gra_line_absolute: pop af pop de ; de = x pop hl ; hl = y push hl push de push af jp 0xbbf6

Oh thanks, you're right.
I had some intuition at the time that something better could be done but I've just taken and expanded blindly that pattern from Arnoldemu's http://www.cpctech.org.uk/download/contiki12.zip e.g. contiki-cpc/arch/conio.s:

Quote_gotox::
      ld      hl,#2
      add      hl,sp
      ld      a,(hl)
      inc      a
      call   0xBB6F   ; TXT SET COLUMN
      ret

But that was for one 8-bit argument. For two 16-bit arguments, your version is much more elegant than mine. I might update cpcitor/cpc-dev-tool-chain · GitHub (especially cpc-dev-tool-chain/cpclib/cfwi/src at master · cpcitor/cpc-dev-tool-chain · GitHub), pull requests welcome any time.

Does offering more choice to the compiler increase register pressure ?

Quote from: Alcoholics Anonymous on 08:08, 11 April 14
The problem with that is you are placing a lot of register pressure on the compiler. If it must have x in de, when it puts x there, it can no longer use de to get y into hl. Getting y may involve a computation which cannot be done without the help of de, so de gets pushed on the stack and recovered after hl is loaded with y.

(snipped interesting example of the different refinement steps in linkage convention -- I like a lot the CALLEE where you just pop the arguments and don't restore the stack)

Well, doesn't that register pressure you mention come from the firmware style of passing parameters via registers ? It's not new.

My proposition is like promoting an ASM-level practice to something that the compiler deals with.

It is be true that this requires the compiler to deal with that (pre-existing) pressure. The point is hoping a compiler on a fast machine will handle this more optimally than human writing pressure-relieving but slow boilerplate code for it.

Can offering more choice to the compiler result in less good generated code ?

Quote from: Alcoholics Anonymous on 08:08, 11 April 14
Your method is specifying all parameters in registers, but that pressure on the compiler thing may make things worse rather than better.

The compiler can still write such boilerplate wrapper code automatically if it can't find anything better, falling back to some more regular linkage.

In other words, it has a proven, even if not optimal, way to relieve from that (again, pre-existing) pressure.

So, is the argument "that puts more pressure on the compiler" fundamentally relevant ? (In practice it is probably relevant if compiler are not ready yet.)

Promoting an ASM-level practice to a new C-level linkage

Let's consider when the compiler compiles a C function.
Let's call "contract" the input,output,side effect on registers and stack.

"FASTCALL for a 1-argument function", and for any n, "small-C for an n-argument function", "CALLEE for an n-argument function" and the like are all contracts.

z88dk shows that it is possible for the compiler to deal with different contracts (cf. optimization [z88dk]), with some caveats.

Now the idea would be, to have more possible contracts for the compiler. That's a new linkage, let's call it CUSTOM.

The compiler could figure out from the context a particular, better suited, contract. That would get most of the benefits of inlining, as SDCC does, and most of the benefits of z88dk's different linkages, but automatically tuned (in the spirit of Interprocedural optimization).

Concrete benefits

For example, for a small short function called from many places or inside loops taking not too many parameters, the function would be compiled "bare" taking all registers, no stack.

For a big function, called seldom, the smallC (or callee) linkage would automatically be chosen.

Comparison with existing

Can we think of that like a different strategy besides SDCC (lots of efforts including register allocation, inlining functions for speed) and z88dk (simpler dumber compiler backed with much hand-written library code) ?

That new strategy, modeled after the "passing by registers" CPC firmware style appears to me like it could potentially do better than the other two.

That's ambitious

This has to come with a price, no wonder that it is harder to implement (if it was easy, that would mean that other compilers have been wasting effort for long ;-).

Quote from: Alcoholics Anonymous on 08:08, 11 April 14
Communication of altered registers back to the compiler may be something in the long run.

Do you mean no one compiler does it now ?
Do you mean that no one compiler fits the schema I've written ? "The compiler has a table stating for each CPU instruction what are its input, output and other changes. Now this function call is like a sort-of big instruction that happens to have inputs, outputs and other changes. Let's apply standard register allocation algorithms to that."

Question: to me, the CPC firmware style of making specific contracts (input, output, side effects) for each routine was standard practice for any ASM project for 8-16 bits CPU. Do you confirm that ?

Wow that was long (my posts are often long, they come after much thinking). Thank you for your attention.

Alcoholics Anonymous · 23:40, 13 April 14

Quote from: AMSDOS on 10:34, 11 April 14

In the Small-C cpciolib.c library there's this routine which starts off a bit like your example:

Code Select Expand
oscall(adr,regpack) int adr; int *regpack; /* af,hl,de,bc */ { #asm pop bc ; ret pop de ; regs pop hl ; adr push hl push de push bc ld (031h),hl ; rst 30h: user ld a,0c3h ld (030h),a push de ex (sp),ix ld l,(ix+0) ld h,(ix+1) push hl pop af ld l,(ix+2) ld h,(ix+3) ld e,(ix+4) ld d,(ix+5) ld c,(ix+6) ld b,(ix+7) ex (sp),ix rst 30h ; execute os-call ex (sp),ix ld (ix+2),l ld (ix+3),h push af pop hl ld (ix+0),l ld (ix+1),h ld (ix+4),e ld (ix+5),d ld (ix+6),c ld (ix+7),b pop de JP CCSXT ;move A to HL & sign extend #endasm }

That's not a bad thing to do, especially if you are concerned about code size. One function can serve any OS call, rather than having to supply one function per OS call. The overhead is high though and will show up if the OS function is lightweight. Eg, something like changing pen colour might be a few lines of asm but you're incurring these slow loads and stores using ix as overhead.

I can show you one thing we are doing and that is to create a structure on the stack and then pass the address of the top of the stack to another function for consumption. Instead of writes to static memory and associated "ld r,(ix+d)" type instructions to collect into registers, we replace that with push in the caller function and (maybe) pop in the function called to load registers. The firmware is set so it's not like you could do this with the Amstrad firmware now but just for information's sake:

Code Select


   ; create a fake FILE structure on the stack
   
   ld hl,0
   push hl
   ld hl,$4000 + (vsprintf_outchar / 256)
   push hl
   ld hl,195 + ((vsprintf_outchar % 256) * 256)
   push hl
   
   ld ix,0
   add ix,sp                   ; ix = vsprintf_file *

That is an excerpt from sprintf where a FILE structure is created on the stack before a call to vfprintf is made. The FILE structure ensures vfprintf output passes through sprintf's output routine. ix is used in stdio as FILE*.

This sort of thing happens in many locations in the library where static globals are replaced by temporary objects on the stack. This can also be used as a way to communicate register values without passing through a static structure -- push values on the stack, call function, called function pops off the value and places return value on stack in right place. This is a kind of dynamic callee interface.

Code Select


oscall(address)

oscall:

   ; ix = address

   pop hl   ; hl = return address
   pop af   ; af = oscall af
   pop bc   ; bc = oscall bc
   pop de    ; de = oscall de
   ex (sp),hl   ; hl = oscall hl, return address on stack

   jp (ix)    ; make call to OS address

;; an example of using this

push OSCALL_HL
push OSCALL_DE
push OSCALL_BC
push OSCALL_AF
ld ix,OSCALL_FUNCTION
call oscall

; that's it, registers set according to output of oscall

The other reason we do this in z88dk is that we want to eliminate statics completely from the library so that it is multi-threading safe and it turns out this also tends to lead to better code.

Alcoholics Anonymous · 00:55, 14 April 14

Quote from: cpcitor on 11:13, 11 April 14
Promoting an ASM-level practice to a new C-level linkage

Let's consider when the compiler compiles a C function.
Let's call "contract" the input,output,side effect on registers and stack.

"FASTCALL for a 1-argument function", and for any n, "small-C for an n-argument function", "CALLEE for an n-argument function" and the like are all contracts.

z88dk shows that it is possible for the compiler to deal with different contracts (cf. optimization [z88dk]), with some caveats.

Now the idea would be, to have more possible contracts for the compiler. That's a new linkage, let's call it CUSTOM.

The compiler could figure out from the context a particular, better suited, contract. That would get most of the benefits of inlining, as SDCC does, and most of the benefits of z88dk's different linkages, but automatically tuned (in the spirit of Interprocedural optimization).

FASTCALL and CALLEE are simple to introduce because they are compatible with normal C-linkage where you compute a param value, push it, compute another param value, push it, etc. FASTCALL is compute param value but don't push it.

The register contract in asm is the normal interface for most asm routines, I agree with you. But unless you are passing constants or static variables to the function, your code may in fact be worse than using a callee interface simply because computing non-trivial parameters may mean saving registers to the stack which is one half of the callee convention, with the other half occurring when you pop those registers back. I think looking at a lot of asm code using functions may reveal what benefit a register only interface has -- I think most asm code is using statics and constants and perhaps one or two computed values so the register only interface works. It may not be as often useful as one might hope for in the context of a C program, however I do think like you -- what humans do should be an indication of the best way to do things for a C compiler.

Currently there is no way to communicate to the z80 C compilers what is happening inside an external function so it must assume everything gets modified. sccz80 just assumes everything is toasted and sdcc assumes the same except it also imposes on the called function a condition that ix must not be modified. That communication is what your proposal is about.

In order for such a thing to lead to better code, the compiler has to be able to take advantage of it and that means the compiler must be able to allocate important variables to registers that stay live over several statements. sccz80 cannot do this without changes to the compiler's structure. sdcc might be able to take better advantage of this but the few registers the z80 has available (and non-orthogonal at that) fights the long-lived allocation of variables into registers. It will probably have to push and pop some frequently used variables near the top of the stack (humans do too) and again half the callee contract appears. It's also worth noting that neither sdcc nor sccz80 make use of the EXX set so they are confined to using af,bc,de,hl. In my own asm code, it's actually been rare that using the exx set has helped locally except when doing two parallel tasks so that communication of values between the exx sets is limited.

And yet this is how things are done in asm as you say

And my answer is if that's how it's done by hand, then the compiler should be able to generate code that way, even if it often can't. It's a thing that's much easier said than done and would involve a re-write of sdcc's z80 code generator, sdcc's register allocator, sdcc's AST to include decorations to indicate what registers are unchanged in function calls and probably more things. Just realize the moment you sneak up to the AST level, you are affecting the C compiler for all sdcc's targets and suddenly you have objections from the 8051, pic, stm8, etc people who maybe cannot see any benefit at all in doing this for their architectures. The people that would have to do the work are going to be wondering if there would be much improvement in the code generated (see my comments about half the callee contract inadvertently showing up).

I am always motivated by what do expert asm programmers do, so I would be for doing something like that but it's something that entails a lot of work and might face some resistance from people who would have to do the work.

cpcitor · 12:19, 27 December 17

Quote from: cpcitor on 12:18, 09 April 14
Yet another calling convention is used by the CPC firmware : just use registers as you want, ensure (or not) that some have their value unchanged on return and make that part of the contract.

(...)

we would not need a wrapper at all and just have user code including a header file that contains this:

Code Select Expand
void __MODS_AF__ __MODS_BC__ __MODS_DE__ __MODS_HL__ fw_gra_line_absolute(__ARG_DE__ int x, __ARG_HL__ int y);

... and compiler would figure out the rest. Instead of generating that pesky long code that can't be optimized properly with peephole rules, it could maybe just optimize called code as it optimizes local (or inlined) code, that is: "The compiler has a table stating for each CPU instruction what are its input, output and other changes. Now this function call is like a sort-of big instruction that happens to have inputs, outputs and other changes. Let's apply standard register allocation algorithms to that."

What SDCC supports is not exactly that, but it's interesting nonetheless: since firmware already takes care to preserve some registers, sdcc can avoid saving them for nothing.

Quote3.5.6 Preserved register specification

SDCC allows to specify preserved registers in function declarations, to enable further optimizations on calls to
functions implemented in assembler. Example for the Z80 architecture specifying that a function will preserve
register pairs bc and iy:
void f(void) __preserves_regs(b, c, iyl, iyh);

This is actually orthogonal to the question of CALLEE, FASTCALL which only relate to how information is passed.

It somehow relieves the compiler of saving/restoring registers that from now on are known to be preserved anyway.

cpcitor · 21:20, 27 December 17

Quote from: cpcitor on 12:19, 27 December 17
What SDCC supports is not exactly that, but it's interesting nonetheless: since firmware already takes care to preserve some registers, sdcc can avoid saving them for nothing.

This is actually orthogonal to the question of CALLEE, FASTCALL which only relate to how information is passed.

It somehow relieves the compiler of saving/restoring registers that from now on are known to be preserved anyway.

It works!

I added the __preserves_regs type annotation on a number of C prototypes in cpc-dev-tool-chain/cpclib/cfwi/include/cfwi at master · cpcitor/cpc-dev-tool-chain and recompiled https://github.com/cpcitor/color-flood-for-amstrad-cpc .

I did only a quick test (without a high max-alloc-per-node).
Not a big deal but it *did* reduce code size somehow. Looking at ASM code, I saw an instance where useless PUSH/POP pair disappeared.
When several firmware calls are chained, the optimization can be defeated by one specific firmware call that needs the PUSH/POP pair again, which shows that SDCC still does its job.
It also looks like it causes no-overall-effect changes in where SDCC stores some registers via IX.

Anyway, I include that in https://github.com/cpcitor/cpc-dev-tool-chain/tree/master/cpclib/cfwi/include/cfwi

News:

Making C compiler aware of CPC Firmware calling convention ?