CPCWiki forum

General Category => Programming => Topic started by: ervin on 17:11, 01 August 13

Title: 32-character-width screen mode
Post by: ervin on 17:11, 01 August 13
Hi everyone.

I'm almost certain I've seen code before that puts the CPC's screen into "spectrum mode".
I'm not talking about 32x32, but about 32x24 (but 32x25 would be better).

Can anyone help me out with some code to enable this mode?

Also (possible dumb question alert!!!) does changing to such a mode free up some memory?
Or are we still dealing with 16K worth of screen to update? (i.e. is that narrower screen still scattered through RAM from &c000 onwards?)

Thanks for any pointers.
Title: Re: 32-character-width screen mode
Post by: ralferoo on 18:18, 01 August 13
I'm almost certain I've seen code before that puts the CPC's screen into "spectrum mode".
I'm not talking about 32x32, but about 32x24 (but 32x25 would be better).
Code: [Select]
out &bc01,32    : rem 32 characters wide
out &bc02,&2a  : rem horizontal position

out &bc06,24    : rem 24 characters tall
Title: Re: 32-character-width screen mode
Post by: Sykobee (Briggsy) on 18:23, 01 August 13
You'll want to read up on the CRTC: CRTC - CPCWiki (http://www.cpcwiki.eu/index.php/CRTC) and how in the CPC it generates memory addresses (handy table on The 6845 Cathode Ray Tube Controller (CRTC) (http://tinyvga.com/6845) that should be on the wiki page)


In particular, register 1 (displayed width) and register 6 (displayed height) to set the size, and then register 2 and register 7 to adjust where it appears on the screen.
(see Ralfaroo's comment above to see values to use).

It does free up memory (as a 32x24 display uses 12KB of screen memory exactly, rather than the 16000 bytes the normal CPC screen uses), but not in a single contiguous block due to how the CPC arranges its screen memory.

A standard CPC screen has 384 bytes of the 16KB memory free, but they're free in 8 different 48 byte locations in that memory due to the screen layout (see the spare start/end here: http://www.cpcmania.com/Docs/Programming/Painting_pixels_introduction_to_video_memory.htm). You can see from this that reducing the screen height to 24 characters frees up a bit more memory (an extra 80 bytes) in each of those 8 locations.

This new mode similarly scatters the display memory throughout the 16KB block, but you only need to update 12KB of it to update what is shown on the screen. Reducing the screen width to 32 characters (64 bytes) means that each MemoryRow* in the screen display is now only 64*24 bytes (1536 bytes) rather than 80*24 bytes (1920 bytes). This gives you 8 areas of 512 bytes that you can use for whatever you want - for example graphics storage, lookup tables, etc.


Unless you start scrolling the screen using the CRTC.  I'll leave that for someone who knows more about that aspect than me.


* erm, a MemoryRow is actually 24 characters rows of 32 characters, one pixel height of each, due to screen memory layout / CRTC/ULA cleverness - I need a better name! Sorry!
Title: Re: 32-character-width screen mode
Post by: TFM on 20:52, 01 August 13
Code: [Select]
out &bc01,32    : rem 32 characters wide
out &bc02,&2a  : rem horizontal position

out &bc06,24    : rem 24 characters tall
That's wrong!
Try this:
Code: [Select]

10 MODE 1
20 OUT &BC01,&01
30 OUT &BD20,&20:REM 32 chars
40 OUT &BC02,&02
50 OUT &BD2A,&2A:REM H-Pos.
60 OUT &BC06,&06
70 OUT &BD19,&19:REM 25 lines

That works ;-) Have fun!
Title: Re: 32-character-width screen mode
Post by: ralferoo on 22:33, 01 August 13
That's wrong!
Yeah, my bad. That was comes of writing stuff from memory without thinking first!  :-[
Title: Re: 32-character-width screen mode
Post by: TFM on 23:32, 01 August 13
Ha-ha! Happens when mixing Basic and Assembler in mind. You definitely work too much.  :)
Title: Re: 32-character-width screen mode
Post by: ervin on 02:08, 02 August 13
Sykobee - thanks for all that information! Absolutely brilliant.
 
Ralferoo and TFM/FS - thanks for the code guys. Much appreciated.
 
[EDIT #1] Actually I reckon I might use the 32x32 mode, and have the top and bottom borders as info panels.  8)
 
[EDIT #2] Hmmm, that's weird.
I can write data to the first 31 character rows.
But when I write to the last character row, the CPC crashes.
 
(http://i69.photobucket.com/albums/i57/poppichicken/32x32.png) (http://i69.photobucket.com/albums/i57/poppichicken/32x32.png)
 
In this screenshot, the last character row has not had anything written to it at all.
All the other rows have had something put into them.

Is there some important stuff in that area of RAM which must not be overwritten?
 
Title: Re: 32-character-width screen mode
Post by: ralferoo on 10:47, 02 August 13
[EDIT #2] Hmmm, that's weird.
I can write data to the first 31 character rows.
But when I write to the last character row, the CPC crashes.

Is there some important stuff in that area of RAM which must not be overwritten?
No, all of the default screen memory space (&C000-&FFFF) is available. The normal sized screen uses 16000 bytes of the 16384, so there's not much unused anyway. Those spare bytes aren't allocated though, because when the screen scrolls normally the start offset of the screen memory changes and it wraps around.

It's most likely your calculations are wrong and you're writing beyond &FFFF and overwriting the important stuff from &0000 onwards. Perhaps you're counting lines from 0 and trying to draw to line 32 (which is actually the 33rd line)?

In a 32 character mode, the starting screen addresses for each character line are very easy to calculate: c000 (line 0),c040 (line 1),c080,c0c0,c100,...c7c0 (line 31) and then you add a multiple of 0800 to that for each pixel line. If you consider the last pixel line of the last character row with this scheme, the first character's address is ffc0, so clearly still within the screen memory range.
Title: Re: 32-character-width screen mode
Post by: arnoldemu on 11:12, 02 August 13
1024 total characters, 2 bytes per character in width, gives &800 for each line.

32x32 = 1024.

So at 32x32 you're using maximum screen capacity - all 16384 bytes.

Perhaps you are overwriting the last line by a few bytes?
Title: Re: 32-character-width screen mode
Post by: ervin on 11:48, 02 August 13
Thanks for the info guys. Indeed I'm no doubt just overwriting memory location &0000. Just gotta figure out where!
Title: Re: 32-character-width screen mode
Post by: redbox on 22:18, 02 August 13
Also, don't forget when using a 32x32 screen on the CPC that it becomes page-aligned.  If you also page-align your sprites (or whatever you want to draw) you only need to do 8-bit increments (e.g. INC L instead of INC HL).

So when you're plotting to the screen, you can simply use code like LD A,(HL) : LD (DE),A : INC L : INC E .  You can also use the SET/RES commands to calculate the screen lines.

All of this adds up to a very fast sprite routine, for example with a 8x8 sprite you would do something like this:

Code: [Select]
; HL = sprite data (page-aligned)
; DE = screen address (page-aligned)

ld a,(hl) : ld (de),a : inc l : inc e    ; line 1
ld a,(hl) : ld (de),a : inc l

set 3,d

ld a,(hl) : ld (de),a : inc l : dec e    ; line 2
ld a,(hl) : ld (de),a : inc l

set 4,d

ld a,(hl) : ld (de),a : inc l : inc e    ; line 4
ld a,(hl) : ld (de),a : inc l

res 3,d

ld a,(hl) : ld (de),a : inc l : dec e    ; line 3
ld a,(hl) : ld (de),a : inc l

set 5,d

ld a,(hl) : ld (de),a : inc l : inc e    ; line 7
ld a,(hl) : ld (de),a : inc l

set 3,d

ld a,(hl) : ld (de),a : inc l : dec e    ; line 8
ld a,(hl) : ld (de),a : inc l

res 4,d

ld a,(hl) : ld (de),a : inc l : inc e    ; line 6
ld a,(hl) : ld (de),a : inc l

res 3,d

ld a,(hl) : ld (de),a : inc l : dec e    ; line 5
ld a,(hl) : ld (de),a

Note that the screen lines are in a non-linear format so you will need to also store your sprite data in the same way.  But you can rip a sprite easily by just swapping the registers over on the routine above.

It becomes even quicker to blank the sprite - just LD A,0 once at the beginning and then each line can simply be LD (DE),A : INC E , which is very quick.  Also, using a mask is quick if you use a look-up table - I can post an example of this if you need it.
Title: Re: 32-character-width screen mode
Post by: ralferoo on 22:38, 02 August 13
Code: [Select]
; HL = sprite data (page-aligned)
; DE = screen address (page-aligned)

ld a,(hl) : ld (de),a : inc l : inc e    ; line 1
ld a,(hl) : ld (de),a : inc l : inc e
...
If you don't mind corrupting BC, it's actually quicker to do LDI in this case (5 cycles instead of 6 per byte)...  ;D
Title: Re: 32-character-width screen mode
Post by: redbox on 23:08, 02 August 13
If you don't mind corrupting BC, it's actually quicker to do LDI in this case (5 cycles instead of 6 per byte)...  ;D

Interesting...!

Wouldn't work on the DEC E rows though (and you'd have to DEC E again before starting the line).

Also, this would be specifically if you are using the one sprite routine.  If you're compiling your sprites, you could pre-load B and C with the most common bytes and then LD A,B : LD (DE),A which would be quicker...?
Title: Re: 32-character-width screen mode
Post by: TFM on 23:46, 02 August 13

In your code you do SET and RES the register E. Shouldn't it be D?



Also, don't forget when using a 32x32 screen on the CPC that it becomes page-aligned.  If you also page-align your sprites (or whatever you want to draw) you only need to do 8-bit increments (e.g. INC L instead of INC HL).

So when you're plotting to the screen, you can simply use code like LD A,(HL) : LD (DE),A : INC L : INC E .  You can also use the SET/RES commands to calculate the screen lines.

All of this adds up to a very fast sprite routine, for example with a 8x8 sprite you would do something like this:

Code: [Select]
; HL = sprite data (page-aligned)
; DE = screen address (page-aligned)

ld a,(hl) : ld (de),a : inc l : inc e    ; line 1
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l

set 3,e

ld a,(hl) : ld (de),a : inc l : dec e    ; line 2
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l

set 4,e

ld a,(hl) : ld (de),a : inc l : inc e    ; line 4
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l

res 3,e

ld a,(hl) : ld (de),a : inc l : dec e    ; line 3
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l

set 5,e

ld a,(hl) : ld (de),a : inc l : inc e    ; line 7
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l

set 3,e

ld a,(hl) : ld (de),a : inc l : dec e    ; line 8
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l

res 4,e

ld a,(hl) : ld (de),a : inc l : inc e    ; line 6
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l : inc e
ld a,(hl) : ld (de),a : inc l

res 3,e

ld a,(hl) : ld (de),a : inc l : dec e    ; line 5
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a : inc l : dec e
ld a,(hl) : ld (de),a

Note that the screen lines are in a non-linear format so you will need to also store your sprite data in the same way.  But you can rip a sprite easily by just swapping the registers over on the routine above.

It becomes even quicker to blank the sprite - just LD A,0 once at the beginning and then each line can simply be LD (DE),A : INC E , which is very quick.  Also, using a mask is quick if you use a look-up table - I can post an example of this if you need it.
Title: Re: 32-character-width screen mode
Post by: ralferoo on 01:42, 03 August 13
Also, this would be specifically if you are using the one sprite routine.  If you're compiling your sprites, you could pre-load B and C with the most common bytes and then LD A,B : LD (DE),A which would be quicker...?
If you're going to be doing that kind of thing, you should probably not use LDI, and swap HL and DE. That way, you're always loading A from (DE), but can write to (HL) with any register.

BTW, if you want the absolute quickest method, disable interrupts and use PUSH instead. A 16-bit load and a PUSH is 7 cycles for 2 bytes, so about the quickest you'll manage. This is getting somewhat off topic for the original question now though... ;)
Title: Re: 32-character-width screen mode
Post by: redbox on 02:16, 03 August 13
@TFM - yes you're right, good spot, will correct it.

@ralferoo - yes I remember swapping the regs but am away from my source code at the moment so am relying on my memory...!  I've yet to try out the stack method but am looking forward to it ;)
Title: Re: 32-character-width screen mode
Post by: ervin on 03:23, 03 August 13
Thanks again for all the info everyone! This has become a very useful thread, with some great tips all in one spot.  :)
Title: Re: 32-character-width screen mode
Post by: Axelay on 09:51, 03 August 13
Interesting...!

Wouldn't work on the DEC E rows though (and you'd have to DEC E again before starting the line).



When using LDI lists for sprites on a page aligned screen, if you are jumping to pixel lines within a character as with your example using SET and RESet, then an alternative to DEC E is to LD A,E initially, and then just LD E,A every time you use SET or RESet.


On the topic of a 12k screen though, you could always use characters set to 6 pixel lines.
Code: [Select]
org &8000

; Set CRTC.R9 to 6-1 scan lines high
ld bc,&BC09
out (c),c
ld bc,&BD05
out (c),c

; set hsync position
ld bc,&bc02
out (c),c
ld bc,&bd00+42
out (c),c

;; set display width of screen
ld bc,&bc01
out (c),c
ld bc,&bd00+32
out (c),c

; set vsync position
ld bc,&bc07
out (c),c
ld bc,&bd00+40
out (c),c

;; set display height of screen
ld bc,&bc06
out (c),c
ld bc,&bd00+32
out (c),c

;; set height of screen
ld bc,&bc04
out (c),c
ld bc,&bd00+51
out (c),c

ret


That gives a 12k screen consuming &c000-&efff, with the single 4k block from &f000-&ffff free to use.  At 32 six pixel characters in height, it's the equivalent of a 24 character high screen, and 32 characters wide.  The only downside with this is if you've got sprites or background tiles that work on the basis of 8 pixel high characters, then the screen addressing becomes very untidy.
Title: Re: 32-character-width screen mode
Post by: TFM on 22:54, 03 August 13
@TFM - ....
Your code snippet is actually a perfect example. And it shows what can be done with some smart ideas on the CPC.  :)
Title: Re: 32-character-width screen mode
Post by: Gryzor on 19:23, 04 August 13
Quote
I'm not talking about 32x32, but about 32x24 (but 32x25 would be better).


Better? How? Why? WHY?


:D
Title: Re: 32-character-width screen mode
Post by: ervin on 14:09, 05 August 13

Better? How? Why? WHY?

 :D

Because it are MORE BIGGER!!!  :P
Title: Re: 32-character-width screen mode
Post by: Gryzor on 20:18, 28 October 13
Roarr.
Title: Beware of slow instructions
Post by: cpcitor on 00:16, 30 October 13

All of this adds up to a very fast sprite routine, for example with a 8x8 sprite you would do something like this:

Code: [Select]
; HL = sprite data (page-aligned)
; DE = screen address (page-aligned)

ld a,(hl) : ld (de),a : inc l : inc e    ; line 1
ld a,(hl) : ld (de),a : inc l

set 3,d


Warning: the innocent-looking set instructions are actually among the slowest instructions that the Z80 has. It's as slow as no less than 7 NOPs !

For the duration of 4 NOPs you can do this:

Code: [Select]
ld a,d
add a,8
ld d,a

Refer to the cheat sheet I compiled from Kevin Thacker's data on Craving for speed ? A visual cheat sheet to help optimizing your code to death. (http://www.cpcwiki.eu/forum/programming/craving-for-speed-a-visual-cheat-sheet-to-help-optimizing-your-code-to-death/)

Cheers,
Title: Re: Beware of slow instructions
Post by: redbox on 00:26, 31 October 13
Warning: the innocent-looking set instructions are actually among the slowest instructions that the Z80 has. It's as slow as no less than 7 NOPs !
For the duration of 4 NOPs you can do this:
Code: [Select]
ld a,d
add a,8
ld d,a

SET takes 5us.

LD A,D : ADD A,8: LD D,A takes 7us.
Title: Re: Beware of slow instructions
Post by: Bruce Abbott on 08:00, 31 October 13
SET takes 5us.

LD A,D : ADD A,8: LD D,A takes 7us.
According to cpcitor's timing chart SET 3,D takes 2us, while LD A,D : ADD A,8: LD D,A takes 4us.

WinAPE agrees, but I wasn't confident that it was accurate. So I wrote a little test program and ran it on a real 6128. This program executes each instruction sequence 10 million times, which takes enough time to be accurately measured with a wall clock.

The results (accurate to 0.1us):-

NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

I was pleasantly surprised to find that the timing was identical when running under WinAPE.

Below is the raw data, and my source code:-

Empty loop:- 2 seconds
NOP:- 12 seconds
SET 3,D:- 22 seconds
LD A,D : ADD A,8: LD D,A:- 42 seconds
Code: [Select]
;==============================================
;    CPC machine code execution time test
;==============================================
; 20 * 250 * 200 * 10 = 10,000,000 (10 million) iterations
; Empty loop takes 2 seconds
; Therefore:- us per iteration = (seconds-2)/10
;
 org #4000
 write "test3.bin"
Test1:
 di
 ld b,10
loopa:
 push bc
 ld c,200
loopc:
 ld b,250
loopb:
 ; put test code here 20 times!
 djnz loopb
 dec c
 jr nz,loopc
 pop bc
 djnz loopa
finish:
 ei
 ret

Test2:
 di
 ld b,10
loopa1:
 push bc
 ld c,200
loopc1:
 ld b,250
loopb1:

 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d

 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d
 set 3,d

 djnz loopb1
 dec c
 jr nz,loopc1
 pop bc
 djnz loopa1
 ei
 ret


Test3:
 di
 ld b,10
loopa2:
 push bc
 ld c,200
loopc2:
 ld b,250
loopb2:
 ld a,d
 add a,8  ; 1
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8   ; 10
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8
 ld d,a

 ld a,d
 add a,8    ; 20
 ld d,a
 
 djnz loopb2
 dec c
 jr nz,loopc2
 pop bc
 djnz loopa2
 ei
 ret


 
Title: Re: Beware of slow instructions
Post by: cpcitor on 12:09, 31 October 13
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

Woops, my mistake. I should have double checked the chart. :-[ You're totally right.  :)

Thanks Bruce for this test and the program! It can be used as a basis for more performance computation (see Using emulator for performance measurement and profiling ? (http://www.cpcwiki.eu/forum/programming/using-emulator-for-performance-measurement-and-profiling/)).

Tied to Y MOD 8 = 0 ?

So, using SET and RES speeds up the "next_scr_line" part of the routine.
You mentioned it also forces to store sprite data in a different order (which is actually Gray code (http://en.wikipedia.org/wiki/Gray_code) order).

Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?
Title: Re: Beware of slow instructions
Post by: Axelay on 12:58, 31 October 13


Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?


If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.
Title: Re: Beware of slow instructions
Post by: redbox on 13:13, 31 October 13
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

So what unit does the WinApe debugger use?  I get 5 for SET and 7 for LD A etc.

If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.

I agree.  In a non-scrolling game, for tiles I use SET, RES etc as they are always on a boundary.  For sprites I use a combination of SET and a Next Line routine because they are always bound to a multiple of 2 (moving 2 pixels at a time makes much more sense on a CPC).
Title: Re: Beware of slow instructions
Post by: Bruce Abbott on 21:19, 31 October 13
So what unit does the WinApe debugger use?
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction. On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.
 
Quote
I get 5 for SET and 7 for LD A etc.
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
           
Title: Re: 32-character-width screen mode
Post by: TFM on 23:10, 31 October 13
SET and RES in sprite code is very, _VERY_ limited. It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

Title: Re: 32-character-width screen mode
Post by: redbox on 01:04, 01 November 13
It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:

Code: [Select]
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc
Title: Re: Beware of slow instructions
Post by: ralferoo on 01:41, 01 November 13
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction.
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.
Quote
On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.
Exactly. Well kind of. Some instructions with an M-cycle that contains 4 T states can be stretched too in some cases, e.g. OUT (C),r takes 4-4-4, but gets stretched to 4-5-4 and so takes 4us on CPC.
Quote
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
SET still only uses 7 T states, it's just there's an extra wait state. But otherwise, yes, exactly right.

The easiest way of thinking about the clock cycles on the Z80 are:
1us for every memory access (including instruction fetch, so 2us for a 2 byte instruction, etc)
1us extra for a 16-bit math operation (e.g. ADD HL,DE) where the ALU does 2 cycles instead of 1
1us extra for an IO access
1us if things can't be pipelined (e.g. PUSH versus POP)
A few other places where an extra 1us gets introduced as wait states stretch another M cycle.

I'll illustrate with the PUSH/POP thing. Take POP first.
You've got a single byte instruction. 1us.
You've got a low-byte read from (SP). 1us.
Whilst that's happening, SP is incremented and the result is ready for the next read.
You've got a high-byte read from (SP). 1us.
 Whilst that's happening, SP is incremented and the result is ready for the next instruction.
Total 3us.

For PUSH, it's similar.
You've got a single byte instruction. 1us.
 Before the first write, SP must be decremented. 1us.
You've got a high-byte write to (SP). 1us.
Whilst that's happening, SP is decremented again.
Finally you've got a low-byte write to (SP). 1us.
Total: 4us.

Another example, ADD A,B
You've got a single byte instruction. 1us.
 The ALU calculates A+B simultaneously with the next instruction decode, so free.
Total: 1us

Another example: ADD HL,DE
You've got a single byte instruction. 1us.
 The ALU calculates E+L. 1us
The ALU calculates D+H+carry. 1us (actually, I don't know why this isn't pipelined!)
Total: 3us
 
Another example: ADD IX,DE
Instruction prefix to modify HL to IX. 1us.
As per ADD HL,DE. 3us.
Total: 4us
 
 Final example: SET 2,B
You've got an instruction prefix. 1us.
 You've got another byte instruction. 1us.
 The ALU calculates B or (1<<2) simultaneously with the next instruction decode, so free
Total: 2us

 I've said it before, but it's worthwhile to download the Z80 UM and understanding what's actually happening. You can infer from the cycle times for each M cycle the kind of thing it's doing and start to understand a feel for how it works. As an example, ADD HL,rp is described as 4-4-3.

But certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...
Title: Re: Beware of slow instructions
Post by: Bruce Abbott on 04:02, 01 November 13
SET still only uses 7 T states,
According to my Zilog Z80 CPU User’s Manual SET bit,r uses 8 T states. But no matter, it works out the same in the end. 

Quote
But certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...
I have done some more investigation, and it looks like most of the exceptions occur when a machine cycle takes 5 T states, as the CPU is then forced to wait for the next memory slot even if the next cycle only takes 3 T states!

So to calculate the total time, take each individual T state time and round it up by 4, then divide the total by 4.

For example:-

INI
 
T States (4, 5, 3, 4) -> 4, 8, 4, 4 = 20 (T states + wait states) -> 5us
     
 
Title: Re: Beware of slow instructions
Post by: AMSDOS on 07:25, 01 November 13
I was pleasantly surprised to find that the timing was identical when running under WinAPE.

So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
Title: Re: Beware of slow instructions
Post by: arnoldemu on 10:53, 01 November 13
So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
Many emulators currently consider the time for the whole instruction for the timings you would see on a cpc.
Zilog documentations normally write in T states, because this is how the cpu operates.
But the CPC video logic tells the z80 to pause. This means that instructions are forced to a multiple of 4T states or 1us cycles.
So, when talking about CPC we need to consider this.

Timing on the spectrum is different, their video hardware forces different pauses on the z80, and it differs with each Spectrum model.
48k has different timings compared to 128k+3. Thankfully on CPC, CPC and Plus have the same overall instruction timings.

Some emus are now considering the exact timings, which consider exactly when the z80 in the cpc reads/writes to memory and I/O. This has more of an effect on when results are seen on the screen, especially when you think of changing the palette rapidly. In addition to this, the results you see are also "shifted" depending on the gate-array in the cpc. (So for example, if you write to a palette register using the same method on the cpc and plus, at the same point in the frame, then then it's likely that the actual colour you see on the screen is in a different position, this time the timing is down to when the video accepts the change and performs it).

But, what is clear in this discussion is which instructions are quick and which are slow, and what cases they are useful in.

Title: Re: 32-character-width screen mode
Post by: TFM on 21:19, 01 November 13
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.

Title: Re: 32-character-width screen mode
Post by: TFM on 21:23, 01 November 13
Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:

Code: [Select]
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc


That check for overflow doesn't make it more precise. But you can use your example of course (the quick way) for 8 scanlines. However I was driving at pixel precise movement. But ok, this may not be always needed.
There is an own ideal strategy for any purpose. For general routines, code should be able to copy with all that.




BTW: One doesn't have to use 8 scanlines / I mean &800 blocks, the CRTC offers more convenient solutions. Few games are using them though.

Title: Re: Beware of slow instructions
Post by: redbox on 15:23, 02 November 13
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.

So if I'm using this counter to optimise routines, 1) can I assume that a lower number is faster and 2) whats the total on this timer for 1 frame on the CPC?
Title: Re: Beware of slow instructions
Post by: Bruce Abbott on 22:58, 03 November 13
1) can I assume that a lower number is faster
Yes.
Quote
2) whats the total on this timer for 1 frame on the CPC?
There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is. Therefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Quote from: TFM
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
Good point. Most Z80 programmers try to avoid using IX and IY when they need speed, but on the CPC this may not always be the best strategy. Before deciding which instructions to use you should compare their timings on the CPC.   
 
Title: Re: Beware of slow instructions
Post by: Executioner on 04:25, 05 November 13
Yes.There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is.

WinAPE has microsecond accurate instruction timing for all instructions and accurate timing for interrupts (which can cause some instructions to delay slightly further). These timings have been measured and tested on both WinAPE and the real hardware.

Just to clarify (I posted more on this in another thread recently). The Z80 may get some wait states added on each instruction depending on the alignment with the clock. It appears the WAIT signal is only released for 1 of every 4 clock cycles. This means it's not actually possible to calculate the number of microseconds for a Z80 instruction simply by using the number of T-States or M-States as specified in the Zilog documentation, unless you look at the exact points the wait-states can be inserted. You are better to use the values found on other CPC specific timing documents on this site. If interrupts are likely to occur and you need accurate timing, it's a little more complicated.

Quote
Therefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Close, it's actually 312 * 64 = 19968 microseconds per frame.
Title: Re: Beware of slow instructions
Post by: cpcitor on 16:17, 05 November 13
Close, it's actually 312 * 64 = 19968 microseconds per frame.

Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?
Title: Re: Beware of slow instructions
Post by: Executioner on 00:20, 08 November 13
Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?

Yes, that is correct.
Title: Re: 32-character-width screen mode
Post by: ralferoo on 13:49, 09 November 13
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
The timing on these instructions is totally as one would expect. There's not even a hint of the unexpected here.

LD HL,(xxxx) and LD (HL),xxxx were on the original 8080, so have normal opcodes (22 and 2A).
These always take 5us due to the memory accesses - IF, addrL, addrH, dataL, dataH.

LD rr,(xxxx) and LD (xxxx),rr are Z80 additional instructions and so are assigned codes in the ED space. There is a duplicated HL version here, but nobody uses it.
These always take 6us due to the memory accesses - IFprefix, IF, addrL, addrH, dataL, dataH.

LD Ir,(xxxx) and LD (xxxx),Ir are Z80 additional instructions but implemented with the IX/IY override bytes (DD or FD) but otherwise use the HL form. These ALWAYS take 1us longer than the equivalent HL form.
These always take 6us due to the memory accesses - IFoverride, IF, addrL, addrH, dataL, dataH.
Title: Re: 32-character-width screen mode
Post by: fgbrain on 11:00, 17 November 15

back to our initial subject..

a 32 X 32 char.  (64 x 256 bytes) screen can have this optimized way to calculate screen addresses, instead of this difficult and  slow equation which is

PROBLEM:
How to calculate in asm this equation
Code: [Select]
ADR=&C000+(Y\8)*64+(Y MOD 8)*&800+X
where X is 0-63 and Y is 0-255


SOLUTION
1. we create a 512 bytes (256 pixels in y) containing each next line in Y, say at &A400  (&C000,&C800,&D000,.....).  WARNING:  each address is not stored normally but like this:
    &C000 :   &00 poked at &A400   and &C0 poked at &A500  (+256 bytes),
    &C800 :   &00 poked at &A401   and &C8 poked at &A501  (+256 bytes), etc etc...
This is vital for speed purposes (8bit number madness).

2. now we can use this easy routine:


Code: [Select]
HL H=X and L=Y coordinates where X is 0-63 and Y is 0-255
LD A,H:LD H,&A4:ADD A,(HL):INC H:LD H,(HL):LD L,A
HL = screen address


Title: Re: 32-character-width screen mode
Post by: rk last on 02:21, 13 August 16
When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.

Title: Re: 32-character-width screen mode
Post by: AMSDOS on 06:44, 13 August 16

When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.


If it's in MODE 0 I do:


x=x*32 - where x is 0..19, but that's for 20-character-width, so: x=x*20 will give you 32.


LOCATE is purely text only, so TAG..TAGOFF & MOVE can be used to position the text.


And for y I use y=398-(y*16) - where y is 0..24.


But you're talking about the Spectrum mode, which is 32 character width in mode 1.