Interesting article on efficient C code

cpcitor · 21:36, 07 April 14

Hi everyone,

Just to share that article I stumbled upon, which explains how to get the best of both worlds when writing C code (wow that's high level) for our beloved Z80 CPU: Efficient C Code for 8-bit Microcontrollers | Embedded Systems Experts.

It starts from the simplest and explains very simply step by step the good practices. It doesn't talk about Z80 or SDCC (most examples are for a 8051) but that applies very well. Plus I did a little check and it seems that SDCC behavior is very efficient.

We have no more excuse neither for using int everywhere, nor for making every variable global (not actually the most efficient, plus it hurts code readability).

Article mentions some handy tips to get things easy when debugging, yet automatically switch to most efficient code by enabling a preprocessor toggle such as DEBUG or NDEBUG.

Do use sane datatypes (8-bit unsigned whenever possible, TODO check how SDCC handles computation with both signed and unsigned),
do structure your code in separate modules (files),
do use static function and variables,
benefit from volatile variables,
benefit from const variables (TODO check if SDCC handles them as well as a #define), const function parameters,
benefit from an assert() macro.
Notice that recursion and variable length argument list are mostly okay on a CPC with SDCC (because firmware sets a stack for us), but prefer printf_tiny rather than printf.
Also SDCC provides dynamic memory allocation (1K by default) that you can tune (well, use it only if you actually need it).

Cheers!

Octoate · 22:31, 07 April 14

In the SDCC Wiki, you will also find a small chapter for writing efficient code when using the Z80 port of SDCC:

Quote
Writing efficient code
===============

memcpy() [Does not apply to gbz80, which doesn't have ldir or equivalent]

- Code generation for memcpy() is very efficient. Don't hesitate to use it. E.g. copying a structure (with at least two member variables) will be much more efficient using memcpy() than by assigning the members.

bool
- The port has complete support for the _Bool/bool data type when compiling in c99 or sdcc99 mode. Use bool for condition variables, since sdcc will generate more efficient code than e.g. when using unsigned char.

signed vs. unsigned
- The Z80 lacks an efficient signed comparison. Using unsigned variables will result in slightly smaller and faster code. On the GBZ80 the situation is even worse, resulting in a more substancial difference in code size and speed.

AMSDOS · 23:12, 07 April 14

I suppose if you substitute Fortran with C from "Real Programmers Don't Use Pascal" you could say "If you can't do it in C, do it in assembly language. If you can't do it in assembly language, it isn't worth doing."

[ot]I have to wonder: if Quiche eaters use Pascal, do BASIC users eat Spaghetti?

[/ot]

cpcitor · 08:24, 08 April 14

Quote from: Octoate on 22:31, 07 April 14
In the SDCC Wiki, you will also find a small chapter for writing efficient code when using the Z80 port of SDCC:

Thanks for sharing. I had already seen (but forgotten since). Here's the URL Z80 port - SDCC wiki

Optimus · 08:40, 08 April 14

Hmm, strange. Use of globals have been promoted by some CPCists I talked about C coding on CPC, for better performance. I guess this was against using function arguments, I can understand how this would be faster. But anyway, I never liked to put everything I have in globals, it indeed does the code dirty, so I continued using arguments (and structs if there are too many arguments to be passed in a function).

cpcitor · 09:26, 08 April 14

Global is good... to a point, as it fights with the compiler.

Quote from: Optimus on 08:40, 08 April 14
Hmm, strange. Use of globals have been promoted by some CPCists I talked about C coding on CPC, for better performance. I guess this was against using function arguments, I can understand how this would be faster.

You're right. Making things global gets some (well, actually most of the) benefits as all locations are decided at compile time. But it prevents compiler from having information to automatically make smart choices.

As explained in the article, static variables and function have limited scope, so the compiler can place them in memory so that they are accessible with shorter instructions.

SDCC is smart, hint it and trust it

I the compiler is dumb do everything by hand, but you might just as well write everything in hand-written assembly. On the contrary, if the compiler is smart just keep your source code readable for you and for it. Every time I checked, assembly generated by z88dk was like stuttering, and assembly generated by SDCC was very good, short, clear and efficient. I think SDCC is smart enough to make good use of a clean source.

Static is an interesting middle ground

Quote from: Optimus on 08:40, 08 April 14
But anyway, I never liked to put everything I have in globals, it indeed does the code dirty, so I continued using arguments (and structs if there are too many arguments to be passed in a function).

Yes, globals pollute a global namespace. On the contrary, you can have static variables limited to the inside of a function, which don't collide with other similar static variables with same name. You get speed and don't sacrifice readability/maintenance.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil" (Knuth)

Anyway, in virtually all programs, only a handful of tight loops actually consume the processing power.
It's much better to code nearly everything with just simplicity and source readability in mind.

Once the program works, I would find the loops taking most time, look at the generated ASM code for these small parts only, adjust C source to get better generated ASM code, and only in special cases (like the innermost loop of the main program/game function) start from that to rewrite a hand-custom assembly loop (if I can do better than SDCC at all, which is not that obvious).

Have I said that SDCC is smart ?

There are generic hints on Wikipedia Program optimization. Have a look at "strengh reduction". Now be aware that SDCC - Small Device C Compiler mentions:

Quotea host of standard optimizations such as global sub expression elimination, loop optimizations (loop invariant, strength reduction of induction variables and loop reversing), constant folding and propagation, copy propagation, dead code elimination and jump tables for 'switch' statements.

I have personally seen sdcc perform dead code optimization, make smart register allocation changes based on, e.g. the number of parameters of a function, perform at compile time whatever computation that was written in source code with just plain variables (that I had not even marked const). Sdcc is smart.

So, I think you're totally right and SDCC is a really smart compiler (at least on Z80 target).

AMSDOS · 10:54, 08 April 14

The only issue I heard with regard to about having Global Variables and Local variables related to the Debugging. Defining all your variables Globally would make it harder to debug if you had the kind of bug which doesn't trigger an error as such, so passing values between routines. I guess the only thing worth having as a global would be an array which might need to be accessed between routines. If it were a large one that was cooped up somewhere in some routine which only had access to it, that would make it a pain to deal with for the rest of the program.

cpcitor · 13:37, 08 April 14

Quote from: AMSDOS on 10:54, 08 April 14
Defining all your variables Globally would make it harder to debug if you had the kind of bug which doesn't trigger an error as such, so passing values between routines.

I don't understand this part. Can you give an example ?

redbox · 16:34, 08 April 14

Reminds me of a conversation I overheard once discussing a student's Java program...

Lecturer: So how could you make it faster?
Student: Write it in C
Lecturer: Yes, that's one way, but could you think of another?
Student: Write it in assembler

Priceless.

Alcoholics Anonymous · 16:57, 08 April 14

Quote from: cpcitor on 21:36, 07 April 14
Just to share that article I stumbled upon, which explains how to get the best of both worlds when writing C code (wow that's high level) for our beloved Z80 CPU: Efficient C Code for 8-bit Microcontrollers | Embedded Systems Experts.

It's good in that it brings up some of the issues in programming small micros in C, but it is a little too general. There is a lot of variation in 8-bit architectures and these affect what parts of the C language can be efficiently translated. C was originally designed to map almost directly to the underlying PDP-11 assembly language and as a result it can translate to very efficient assembly language on any machine that can support a certain minimum ISA. That minimum, of course, is covered by any 32-bit processor but that is not the case for the 8-bits and the 8-bits have different deficiencies.

The article touches on some issues that apply to certain architectures and then supplies general rules like "don't use recursion or variable argument functions" that do not apply to all architectures. If you were to find all the problems for every 8-bit architecture and then decide you will never use those features of the C language for any 8-bit target, you're writing code for a very narrow subset of C that will not run efficiently on any 8-bit.

Quote
We have no more excuse neither for using int everywhere, nor for making every variable global (not actually the most efficient, plus it hurts code readability).

Well int is an efficient data type on the z80. We push and pop ints on the stack; pushing a char as a char is actually worse in terms of performance. We load and store ints to memory where we can use "LD(nn),rp" whereas there is only one 8-bit store "ld(nn),a". Integer math can be done directly with "add hl,rp" and "sbc hl,rp".

When the article speaks of preferring chars, it is thinking of 8-bit architectures that cannot do these things. Having said that, char operations can be more efficient in some circumstances. sdcc, specifically, can benefit from preferring 8-bit types where appropriate as it attempts to keep live variables in registers as long as possible. This is important because the decision to use IX as frame pointer is a very expensive one, and the chances of being able to keep the several variables the program is typically simultaneously operating on in registers is increased with 8-bit types. There is also the Z80 cp instruction which will not change the accumulator and equivalent is not available in 16-bits. And sdcc has expended some effort in making sure use of chars is optimized.

On the question of globals -- there is no doubt on the z80 architecture that it will lead to faster and smaller code. This is because the z80 does not have a stack-relative addressing mode. Accessing local variables can be very slow and can lead to large code. sdcc uses IX as frame pointer so reading and writing variables declared locally involves "ld r,(ix+d)" type instructions, even done twice for int types. It's important that sdcc keeps locals in registers to avoid frequent access to memory via ix. z88dk instead pushes and pops items from the stack to access locals like a human would but it is not able to look ahead as sdcc can in determining what values should best be kept in registers. I would say one of the main reasons translated C compares poorly against hand-written code is this problem of accessing local variables. sdcc comes out of the gate crippled by a design decision to use ix as frame pointer and z88dk does the right thing but is not smart enough to always lead to better code.

Using globals allows direct storage of ints and chars with fast and compact "ld (nn),rp" and "ld (nn),a" type instructions. That includes indexing into arrays or structs (think what has to be done to access an array at a specific index or accessing struct member when these things are declared on the stack -- additions and sometimes multiplications have to be done at runtime). However it makes programs less understandable, which can lead to bugs. You can at least solve the namespace issue of using globals by preferring to use local statics in functions.

Quote
Do use sane datatypes (8-bit unsigned whenever possible, TODO check how SDCC handles computation with both signed and unsigned),

That's right -- do not use a type by default. Use the correct type for the job.

That example from the article of mixed signed and unsigned addition leading to unexpected result is a contrived one. It's only surprising to people who haven't thought about what they are doing. Adding a signed number to an unsigned one is actually a programming error if done by accident. In other languages you would get warnings or errors when trying to do such a thing but C is very permissive because it allows you to do whatever the underlying architecture allows you to do. When writing asm, you need to know you are adding a signed and an unsigned number and when the result comes you look at different flags (overflow or carry?) because you are aware of what you just did. The same applies in C -- if you are going to add a signed number to an unsigned one, it is a deliberate action with results that may not be the same as adding two unsigned or two signed numbers.

You should always prefer an unsigned type unless you actually need a signed type. There are reasons not mentioned in that article -- if you are loose with mixing signed and unsigned arithmetic, the compiler may be required to insert sign extension code that may be unintended by you. But worse -- it has consequences for how multiplication, division and shifting happen. Division of signed numbers is slower than unsigned because the sign of the quotient and remainder have to be explicitly adjusted afterward. A fast multiplication algorithm (such a thing exists in z88dk's dev build) will look at the magnitude of the two multiplicands to immediately determine the answer is too large to represent and will try to reduce the size of the multiplication by looking at leading zero bits. Similarly a fast division algorithm will try to reduce the size of the division by looking at leading zero bits and will return immediate results if the divisor > dividend. These fast algorithms only work on unsigned numbers -- any signed multiplication or division has to take the absolute value of the numbers first and adjust signs afterward.

I didn't check how old that blog entry was, but the author also seems to be a few standards behind. The C standard introduced "stdint.h" years ago to specifically define types that work best on the architecture the compiler is targetting. For example there is a "uint8_t" which is the smallest type to hold an unsigned 8-bit char and "uint_fast8_t" which is the fastest type to use with 8-bit arithmetic. They may not be the same thing -- the former might map to char and the latter might map to int. For accessing hardware, there is also an embedded systems extension he does not mention.

However, even though I am aware of these things I too prefer to define a uchar and uint type -- I am just too used to them

-- and I know what they are turned into by the compiler I'm using. However, if I wanted the source to work equally well on any target, I would have to use those stdint.h types.

Quote
benefit from volatile variables

This will rule out many optimizations by the compiler. I don't think you should use volatile unless you need to make sure the compiler is not optimizing away seemingly superfluous reads and writes. Eg, you will most likely see volatile used to read memory mapped hardware registers. Some unrolled code that reads data from a hardware register might look like this:

Code Select


data[i++] = *hw_data;
data[i++] = *hw_data;
...
data[i++] = *hw_data;

An optimizer might look at that and decide it's going to read hw_data once and store the same value at several locations in the data[] array. That would be incorrect code though and the person programming this would have to declare hw_data volatile to get correct code generation.

Quote
Also SDCC provides dynamic memory allocation (1K by default) that you can tune (well, use it only if you actually need it).

Most programmers blindly use dynamic memory without being made aware of the problems with dynamic memory allocation. Modern machines have lots of physical and virtual memory so these problems very rarely manifest themselves. In fact, an entire language (Java) was designed thinking there was enough memory to never worry about those problems.

The sdcc implementation is fine, with an efficient realloc (the only problem is it's written in C

), so the only issues with it are inherent in any small memory machine. Those issues have to do with fragmentation; if you have a long running program with many varied size allocations and deallocations, you can actually end up in a situation where the program can no longer allocate memory when there is in fact much memory available, if all the small fragments are added together. In embedded systems in general, where programs must run for long periods, usually the heap is frequently reset or a different allocation method is used. z88dk has more allocation options for this reason.

AMSDOS · 01:28, 09 April 14

Quote from: cpcitor on 13:37, 08 April 14
I don't understand this part. Can you give an example ?

It was more or less something I recall hearing when I was programming C years ago, however it was when I was programming on PCs with 32bit processors using Borland C. Alcoholics Anonymous makes it sound like a whole different ballgame when dealing with a 8bit computer which I could understand due to different structures.

Personally though I've only used Small-C on the Amstrad, which I think handles Internal Variables differently from Global Variables when compared to sdcc, though I haven't used sdcc, so it maybe best to question that.

But with regard to Debugging, Internal variables can make it easier to Debug a program because a problem maybe occurring somewhere in the program, rather than having a Global Variable which is accessible to the whole program, internal variables could make it easier to track where a problem maybe occurring by limiting where the fault is to a particular section of code.

Unfortunately I don't have any good examples, Internal Variables are usually something defined inside the main() code or after declaring a function (or whatever it's called - to me it looks like a procedure). Something defined as Global would be at the beginning of the program after a #include <stdio.h> and the whole program can recognise that variable.

Alcoholics Anonymous · 07:25, 09 April 14

Quote from: AMSDOS on 01:28, 09 April 14
Personally though I've only used Small-C on the Amstrad, which I think handles Internal Variables differently from Global Variables when compared to sdcc, though I haven't used sdcc, so it maybe best to question that.

The C name for "internal variables" is "automatic variables" or "local variables." When I was in school, Microsoft used to like to try to stump interviewees by asking what an "auto" variable is in C as a conversation opener. There's this rarely used and obscure keyword in C "auto" whose only purpose is to declare that a particular variable is stored on the stack, which is what the default is when you declare a variable in a function and is why you never see it used. Its counterparts "static" and "register" do have meaning though and are well-understood by everyone. Anyway, "auto" is short for "automatic."

sdcc pushes function parameters on the stack in right-to-left order, then calls the function (places ret address on the stack), then saves ix, then loads ix with current sp value to act as new frame pointer, then creates space on the stack for locals in declaration order either with pushes or direct manipulation of sp, then runs the function, then "ld sp,ix" to clear locals, then pops ix to restore caller's frame pointer, then returns, then executes code to remove the function parameters from the stack (pops or direct manipulation of sp). This is a standard C linkage and can be costly depending on the size of the called function. One person quoted from the manual that "memcpy was very efficiently implemented" -- the reason is that a call to a memcpy routine costs more than inlining the code, so sdcc inlines instances of memcpy. It does the same with strcpy I think. It's faster and smaller on sdcc but the tendency of sdcc to inline code for a lot of things can lead to larger code sizes for real projects in comparison to other compilers that have better linking options. Ah yes, all parameters on the stack are multiplies of two bytes in size (two for char/int and four for long).

All small C derivatives have the same linkage as sdcc except function parameters are pushed in left to right order and they do not use an index register as frame pointer.

z88dk has the tradition small C linkage but also introduces two more: FASTCALL which allows a single parameter to be passed by register and CALLEE which allows the called function to clear up the stack. When using a standard C linkage (whether R->L or L->R parameter order), those extra bytes required after the call to the function to clear up the stack can add up to hundreds of bytes in a large program. These extra alternatives can shave a lot of bytes off code size, particularly for z88dk as best practice involves making lots of calls to library routines.

Quote
But with regard to Debugging, Internal variables can make it easier to Debug a program because a problem maybe occurring somewhere in the program, rather than having a Global Variable which is accessible to the whole program, internal variables could make it easier to track where a problem maybe occurring by limiting where the fault is to a particular section of code.

I agree very much -- it leads to hard-to-debug code, hard-to-maintain code and hard-to-understand code. This is why it's drilled into everyone to avoid using globals -- it's bad in the same sense using goto is bad. But on big machines, there is little or no performance gained by using globals (in fact using globals may affect performance negatively because it can screw with the cache), so you are only introducing problems for yourself by using them haphazardly.

Gryzor · 07:50, 09 April 14

This is off-topic, but:

@Alcoholics Anonymous: Please note that your account was not activated because it was flagged as a spammer... please accept my apologies, and feel free to create it again, this time I will activate it.

Thanks,
Gryzor

arnoldemu · 09:29, 09 April 14

SDCCs code generation is not that great. I thought it was going to be amazing, but it's not as good as I thought

I liked z88dk's fastcall functions and the way I could encourage it to generate better code by re-writing my c slightly.

cpcitor · 12:11, 09 April 14

Quote from: arnoldemu on 09:29, 09 April 14
SDCCs code generation is not that great. I thought it was going to be amazing, but it's not as good as I thought

It would be very interesting if you could post specific examples : C source, generated ASM (from SDCC's *.rst file or z88dk's *.opt files), file sizes.

Hum, both struggle somehow with simple constructs. Code is correct but suboptimal.

Here is an example:

C source

Code Select

void compute_char() {
		static unsigned char myfirstnumber=12;
		static unsigned char mysecondnumber=30;
		static unsigned char myresult;
 		myresult = myfirstnumber + mysecondnumber;
}

void compute_int() {
		static unsigned int myfirstnumber=12;
		static unsigned int mysecondnumber=30;
		static unsigned int myresult;
 		myresult = myfirstnumber + mysecondnumber;
}

int
main (int argc, char **argv)
{
	compute_char();
	compute_int();
}

Hand-written asm code would be :

Code Select


; compute_char() 
ld a,(myfirstnumber)
add a,(mysecondnumber)
ld (myresult),a
ret

; compute_char() -- pasted from a selection of z88dk's output!
	ld	de,(_st_compute_int_myfirstnumber)
	ld	hl,(_st_compute_int_mysecondnumber)
	add	hl,de
	ld	(_st_compute_int_myresult),hl
ret

Very simple test, I would rate C for SDCC and C+ for z88dk ;-)

For all source code, including room for static variables, static init code, main, caller/callee conventions (well, not much to do in this case !), all the stuff, here are the sizes:
sdcc generated code: 0x83 bytes (no AMSDOS header) with command line sdcc -mz80 -c arithtest.c
z88dk generated code: 0x89 bytes (not counting AMSDOS header) with command line zcc +cpc -lndos -a arithtest.c -o arithtest.opt ; zcc +cpc -lndos -create-app arithtest.c -o arithtest.bin

z88dk results

z88dk gets local 16bit unsigned int computation right (but why make all function symbols go to a single JP to another label ?)
z88dk gets 8bit unsigned char computation complicated by fetching 16 bits then zeroing unneeded part. It could have simply fetched 8 bits.

Conclusion : on z88dk using int is faster than char ? Case for adding a real stdint.h content with "uint_fast8_t" being 16-bit on z88dk!

sdcc 3.3 results

sdcc 3.3 appears to make efforts to preserve DE always (ignores --all-callee-saves --callee-saves) which forces it to longer ways.

Alcoholics Anonymous is probably right to say:

Quotesdcc comes out of the gate crippled by a design decision to use ix as frame pointer and z88dk does the right thing but is not smart enough to always lead to better code.

It might be interesting to discuss that on their mailing-lists.

sdcc-generated asm below

Code Select

;--------------------------------------------------------
; File Created by SDCC : free open source ANSI-C Compiler
; Version 3.3.0 #8604 (Oct  8 2013) (Linux)
; This file was generated Wed Apr  9 12:18:31 2014
;--------------------------------------------------------
	.module hello
	.optsdcc -mz80
	
;--------------------------------------------------------
; Public variables in this module
;--------------------------------------------------------
	.globl _main
	.globl _compute_int
	.globl _compute_char
;--------------------------------------------------------
; special function registers
;--------------------------------------------------------
;--------------------------------------------------------
; ram data
;--------------------------------------------------------
	.area _DATA
_compute_char_myfirstnumber_1_1:
	.ds 1
_compute_char_mysecondnumber_1_1:
	.ds 1
_compute_char_myresult_1_1:
	.ds 1
_compute_int_myfirstnumber_1_2:
	.ds 2
_compute_int_mysecondnumber_1_2:
	.ds 2
_compute_int_myresult_1_2:
	.ds 2
;--------------------------------------------------------
; ram data
;--------------------------------------------------------
	.area _INITIALIZED
;--------------------------------------------------------
; absolute external ram data
;--------------------------------------------------------
	.area _DABS (ABS)
;--------------------------------------------------------
; global & static initialisations
;--------------------------------------------------------
	.area _HOME
	.area _GSINIT
	.area _GSFINAL
	.area _GSINIT
;hello.c:2: static unsigned char myfirstnumber=12;
	ld	iy,#_compute_char_myfirstnumber_1_1
	ld	0 (iy),#0x0C
;hello.c:3: static unsigned char mysecondnumber=30;
	ld	iy,#_compute_char_mysecondnumber_1_1
	ld	0 (iy),#0x1E
;hello.c:9: static unsigned int myfirstnumber=12;
	ld	iy,#_compute_int_myfirstnumber_1_2
	ld	0 (iy),#0x0C
	ld	iy,#_compute_int_myfirstnumber_1_2
	ld	1 (iy),#0x00
;hello.c:10: static unsigned int mysecondnumber=30;
	ld	iy,#_compute_int_mysecondnumber_1_2
	ld	0 (iy),#0x1E
	ld	iy,#_compute_int_mysecondnumber_1_2
	ld	1 (iy),#0x00
;--------------------------------------------------------
; Home
;--------------------------------------------------------
	.area _HOME
	.area _HOME
;--------------------------------------------------------
; code
;--------------------------------------------------------
	.area _CODE
;hello.c:1: void compute_char() {
;	---------------------------------
; Function compute_char
; ---------------------------------
_compute_char_start::
_compute_char:
;hello.c:5: myresult = myfirstnumber + mysecondnumber;
	ld	hl,#_compute_char_mysecondnumber_1_1
	push	de
	ld	iy,#_compute_char_myresult_1_1
	push	iy
	pop	de
	ld	a,(#_compute_char_myfirstnumber_1_1 + 0)
	add	a, (hl)
	ld	(de),a
	pop	de
	ret
_compute_char_end::
;hello.c:8: void compute_int() {
;	---------------------------------
; Function compute_int
; ---------------------------------
_compute_int_start::
_compute_int:
;hello.c:12: myresult = myfirstnumber + mysecondnumber;
	ld	hl,#_compute_int_mysecondnumber_1_2
	push	de
	ld	iy,#_compute_int_myresult_1_2
	push	iy
	pop	de
	ld	a,(#_compute_int_myfirstnumber_1_2 + 0)
	add	a, (hl)
	ld	(de),a
	ld	a,(#_compute_int_myfirstnumber_1_2 + 1)
	inc	hl
	adc	a, (hl)
	inc	de
	ld	(de),a
	pop	de
	ret
_compute_int_end::
;hello.c:16: main (int argc, char **argv)
;	---------------------------------
; Function main
; ---------------------------------
_main_start::
_main:
;hello.c:18: compute_char();
	call	_compute_char
;hello.c:19: compute_int();
	jp	_compute_int
_main_end::
	.area _CODE
	.area _INITIALIZER
	.area _CABS (ABS)

z88dk generated ASM below

Code Select

;* * * * *  Small-C/Plus z88dk * * * * *
;  Version: 20100416.1
;
;	Reconstructed for z80 Module Assembler
;
;	Module compile time: Wed Apr  9 12:16:03 2014



	MODULE	arithtest.c


	INCLUDE "z80_crt0.hdr"


;	SECTION	code


._compute_char
	jp	i_3
;	SECTION	text

._st_compute_char_myfirstnumber
	defm	""
	defb	12

;	SECTION	code


;	SECTION	text

._st_compute_char_mysecondnumber
	defm	""
	defb	30

;	SECTION	code


.i_3
	ld	hl,(_st_compute_char_myfirstnumber)
	ld	h,0
	ex	de,hl
	ld	hl,(_st_compute_char_mysecondnumber)
	ld	h,0
	add	hl,de
	ld	h,0
	ld	a,l
	ld	(_st_compute_char_myresult),a
	ret



._compute_int
	jp	i_6
;	SECTION	text

._st_compute_int_myfirstnumber
	defw	12
;	SECTION	code


;	SECTION	text

._st_compute_int_mysecondnumber
	defw	30
;	SECTION	code


.i_6
	ld	de,(_st_compute_int_myfirstnumber)
	ld	hl,(_st_compute_int_mysecondnumber)
	add	hl,de
	ld	(_st_compute_int_myresult),hl
	ret



._main
	call	_compute_char
	call	_compute_int
	ret




; --- Start of Static Variables ---

;	SECTION	bss

._st_compute_char_myresult	defs	1
._st_compute_int_myresult	defs	2
;	SECTION	code



; --- Start of Scope Defns ---

	XDEF	_compute_char
	XDEF	_main
	XDEF	_compute_int


; --- End of Scope Defns ---


; --- End of Compilation ---

Overall conclusion

This very local (toy) test appears to be in favor of z88dk.

Before that I've compared z88dk-generated and sdcc-generated code on real (though simple) code (not using static/global variables by the way). It was in clear favor or sdcc.

We can't conclude from this alone.

So far I'm still for SDCC because of its superior design making it able to progress more than z88dk (the option in z88dk to use sdcc as a compiler may be a hint ?).

Anyone wanting to discuss this test and the result on SDCC mailing list ?

Quote from: arnoldemu on 09:29, 09 April 14
I liked z88dk's fastcall functions and the way I could encourage it to generate better code by re-writing my c slightly.

I'll post another thread about calling conventions.
How would you rewrite your C slightly to help z88dk ? Can you give an example ?

Alcoholics Anonymous · 19:12, 09 April 14

Quote from: cpcitor on 12:11, 09 April 14

Hand-written asm code would be :

Code Select Expand
; compute_char() ld a,(myfirstnumber) add a,(mysecondnumber) ld (myresult),a ret

Whoops, there's no "add a,(nn)" instruction

You can do one of two ways, using either a mixed 16-bit / 8-bit load or a pure 8-bit load:

Code Select


;;; pure 8-bit

ld a,(myfirstnumber)
ld e,a
ld a,(mysecondnumber)
add a,e
ld (myresult),a
ret

57 cycles
12 bytes

;; mixed 16-bit

ld hl,(myfirstnumber)
ld a,(mysecondnumber)
add a,l
ld (myresult),a
ret

56 cycles
11 bytes

It's faster and smaller to use the mixed method, which underlines the semi-16-bit characteristics the z80 has. These are z80 cycle counts not adjusted for any cycle stretching caused by hardware, and is what a compiler will worry about.

Code Select


;;; you had the int one copied instead of char, so corrected here
;;; z88dk

	ld	hl,(_st_compute_char_myfirstnumber)
	ld	h,0
	ex	de,hl
	ld	hl,(_st_compute_char_mysecondnumber)
	ld	h,0
	add	hl,de
	ld	h,0
	ld	a,l
	ld	(_st_compute_char_myresult),a
	ret


95 cycles
19 bytes


this method by hand:

   ld de,(myfirstnumber)
   ld hl,(mysecondnumber)
   add hl,de
   ld a,l
   ld (myresult),a
   ret

74 cycles
12 bytes

z88dk is tangled up by not realizing it doesn't have to immediately perform an int->char conversion after the loads. The "ld hl,(nn)" instruction is faster than "ld de,(nn)" so the first load from myfirstnumber "ld hl,(nn);..; ex de,hl" is equivalent to the more direct "ld de,(nn)" in terms of speed but costs one byte more.

Internally z88dk (rather sccz80) is thinking in terms of "grab this first parameter using primary register (operations with primary register are supposed to be faster -- and it is in this case), we need to save it so place it in secondary register ("ex de,hl"), go get the next parameter in primary register, etc. There is no consideration of whether using the secondary register for the first load would have been faster (no) or smaller (yes).

Code Select


;;; sdcc char

	ld	hl,#_compute_char_mysecondnumber_1_1
	push	de
	ld	iy,#_compute_char_myresult_1_1
	push	iy
	pop	de
	ld	a,(#_compute_char_myfirstnumber_1_1 + 0)
	add	a, (hl)
	ld	(de),a
	pop	de
	ret


107 cycles
18 bytes

This looks like a bug in sdcc to me, where proper cost of using iy is not being applied. Using an index register to load a constant and then pass it to de via the stack is something sdcc is perfectly capable of avoiding. There is an option in sdcc to turn off use of iy, maybe you could try again with that enabled to see how much difference that makes. They've also committed some fixes to the z80 code generator and they have a new release coming in about a week so it may be worthwhile to try again after that happens.

Quote
It might be interesting to discuss that on their mailing-lists.
So far I'm still for SDCC because of its superior design making it able to progress more than z88dk (the option in z88dk to use sdcc as a compiler may be a hint ?).

I'm actually one of the z88dk devs, so I can say we do know about all these issues

sdcc's compiler is more modern and has more potential moving forward.

z88dk's (called sccz80) has a very long lineage, going back 30 years to the original small C by Ron Cain. The small C derivatives that are around, including the one for Amstrad, are derived from that work as well but z88dk's version has seen a lot of development so that only the structure of the compiler remains the same. Unfortunately it is that structure that limits the kind of optimizations the compiler is able to perform. For ten years+, the topic of rewriting the compiler from scratch has come up occasionally but it's a major undertaking no one has wanted to commit to. And then sdcc came along.

sdcc is capable of making better optimization decisions than sccz80, there is no question of that, but whether that leads to a better result is not guaranteed.

For example all z80 compilers, as they are now, generate code that is from three to five times slower and bigger than hand-written code. What this means is, by first approximation, a compiler with access to a lot of hand-written library routines will always win in terms of code size and speed, if the C program is written to use library routines. That has been the path z88dk has been following -- it has more than 100k lines of asm library routines, and C has been treated as a kind of glue logic for calling asm library routines. So whereas in sdcc a call to qsort would be a call to a compiled C implementation of qsort, a call to qsort in z88dk goes to a hand-made routine that will be 3-5x smaller and faster.**

** Actually, for qsort, the existing implementation has been passed through the compiler and then hand optimized. The new clib contains a version written from scratch.

So the first issue for sdcc is that it doesn't have much in the way of libraries and what it does have is mainly written in C. The first issue for z88dk is its compiler is weaker than sdcc's but it has all these libraries. It's like Reese's peanut butter cups all over again, and the two should be put together, which is what has been happening lately. sdcc added a few things to smooth issues with the z80 targets in z88dk and z88dk has been doing things to allow optional compilation using sdcc. z88dk, in fact, has a new clib that allows using either sccz80 or sdcc to generate code with it and there is a working target using that now.

But even with the library issue aside, the strategy that sccz80 is supposed to take is to reduce code size. It does this by making calls to subroutines to do all sorts of little tasks. Need to compare to 16-bits? There's a function for that. Need to increment a long? There's a function for that too. sdcc instead wants to inline code. This is faster and on the surface looks better, but it is also larger. If I compare two ints 10 times in my code, the sccz80 method of using the slow subroutine to do it may result in smaller code than sdcc's faster inline comparison. With sccz80, z88dk has been willing to pay that cost because the idea is most runtime is spent in the asm library routines than the glue code.

Then with the ix versus push/pop for locals, whether one method leads to larger code than another depends on data access patterns, the number of local variables (this includes parameters passed to the function), etc.

To sum up it's not clear to me at least, if sdcc can manage to generate smaller code than the less optimal sccz80. We're going to find out though because the new clib in z88dk means sdcc has access to the same library as sccz80.

Don't get me wrong here -- I am personally excited about using sdcc as a compiler option but I am just not yet convinced as things are now that it will lead to smaller code, which is currently the main struggle for us on a 64k system. Even if it doesn't right now, sdcc will only get better with time.

Alcoholics Anonymous · 08:20, 11 April 14

Quote
For all source code, including room for static variables, static init code, main, caller/callee conventions (well, not much to do in this case !), all the stuff, here are the sizes:
sdcc generated code: 0x83 bytes (no AMSDOS header) with command line sdcc -mz80 -c arithtest.c
z88dk generated code: 0x89 bytes (not counting AMSDOS header) with command line zcc +cpc -lndos -a arithtest.c -o arithtest.opt ; zcc +cpc -lndos -create-app arithtest.c -o arithtest.bin

When comparing binary sizes, especially for small programs, you also have to be careful to subtract out the crt size (that's the c runtime startup stub that sets up the c environment prior to calling main). sdcc's is very minimal as it doesn't have any target-specific code nor does it have things like FILEs (stdin,stdout,stderr), it's missing half of stdio (the scanf half), etc. The z88dk cpc target will provide STDOUT, STDIN, STDERR, some kind of basic keyboard input, some basic character output, static variables for the graphics library, etc, and that stuff might also cause more code to attach. If you want an easy way to compare code size you really have to provide the same crt for both compilers to work from.

News:

Interesting article on efficient C code

Alcoholics Anonymous

Alcoholics Anonymous