Author Topic: 32-character-width screen mode  (Read 5942 times)

0 Members and 1 Guest are viewing this topic.

Offline cpcitor

  • The user previously known as FindYWay
  • CPC6128
  • ****
  • Posts: 235
  • Country: fr
  • My heart still runs on traditional CPC.
    • My code for the CPC.
  • Liked: 111
Re: Beware of slow instructions
« Reply #25 on: 12:09, 31 October 13 »
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

Woops, my mistake. I should have double checked the chart. :-[ You're totally right.  :)

Thanks Bruce for this test and the program! It can be used as a basis for more performance computation (see Using emulator for performance measurement and profiling ?).

Tied to Y MOD 8 = 0 ?

So, using SET and RES speeds up the "next_scr_line" part of the routine.
You mentioned it also forces to store sprite data in a different order (which is actually Gray code order).

Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC.

Offline Axelay

  • 6128 Plus
  • ******
  • Posts: 533
  • Country: au
  • Liked: 334
Re: Beware of slow instructions
« Reply #26 on: 12:58, 31 October 13 »


Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?


If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
Re: Beware of slow instructions
« Reply #27 on: 13:13, 31 October 13 »
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

So what unit does the WinApe debugger use?  I get 5 for SET and 7 for LD A etc.

If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.

I agree.  In a non-scrolling game, for tiles I use SET, RES etc as they are always on a boundary.  For sprites I use a combination of SET and a Next Line routine because they are always bound to a multiple of 2 (moving 2 pixels at a time makes much more sense on a CPC).

Offline Bruce Abbott

  • CPC664
  • ***
  • Posts: 53
  • Country: nz
  • Liked: 84
Re: Beware of slow instructions
« Reply #28 on: 21:19, 31 October 13 »
So what unit does the WinApe debugger use?
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction. On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.
 
Quote
I get 5 for SET and 7 for LD A etc.
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
           

Offline TFM

  • Visit the mysteries of the CPC at www.futureos.de
  • Supporter
  • 6128 Plus
  • *
  • Posts: 9.899
  • Country: aq
  • Space Chicken for FutureOS is free!
    • index.php?action=treasury
    • FutureOS - The revolution on CPC!
  • Liked: 1972
Re: 32-character-width screen mode
« Reply #29 on: 23:10, 31 October 13 »
SET and RES in sprite code is very, _VERY_ limited. It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
Re: 32-character-width screen mode
« Reply #30 on: 01:04, 01 November 13 »
It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:

Code: [Select]
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc

Offline ralferoo

  • Supporter
  • 6128 Plus
  • *
  • Posts: 966
  • Country: gb
  • Liked: 579
Re: Beware of slow instructions
« Reply #31 on: 01:41, 01 November 13 »
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction.
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.
Quote
On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.
Exactly. Well kind of. Some instructions with an M-cycle that contains 4 T states can be stretched too in some cases, e.g. OUT (C),r takes 4-4-4, but gets stretched to 4-5-4 and so takes 4us on CPC.
Quote
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
SET still only uses 7 T states, it's just there's an extra wait state. But otherwise, yes, exactly right.

The easiest way of thinking about the clock cycles on the Z80 are:
1us for every memory access (including instruction fetch, so 2us for a 2 byte instruction, etc)
1us extra for a 16-bit math operation (e.g. ADD HL,DE) where the ALU does 2 cycles instead of 1
1us extra for an IO access
1us if things can't be pipelined (e.g. PUSH versus POP)
A few other places where an extra 1us gets introduced as wait states stretch another M cycle.

I'll illustrate with the PUSH/POP thing. Take POP first.
You've got a single byte instruction. 1us.
You've got a low-byte read from (SP). 1us.
Whilst that's happening, SP is incremented and the result is ready for the next read.
You've got a high-byte read from (SP). 1us.
 Whilst that's happening, SP is incremented and the result is ready for the next instruction.
Total 3us.

For PUSH, it's similar.
You've got a single byte instruction. 1us.
 Before the first write, SP must be decremented. 1us.
You've got a high-byte write to (SP). 1us.
Whilst that's happening, SP is decremented again.
Finally you've got a low-byte write to (SP). 1us.
Total: 4us.

Another example, ADD A,B
You've got a single byte instruction. 1us.
 The ALU calculates A+B simultaneously with the next instruction decode, so free.
Total: 1us

Another example: ADD HL,DE
You've got a single byte instruction. 1us.
 The ALU calculates E+L. 1us
The ALU calculates D+H+carry. 1us (actually, I don't know why this isn't pipelined!)
Total: 3us
 
Another example: ADD IX,DE
Instruction prefix to modify HL to IX. 1us.
As per ADD HL,DE. 3us.
Total: 4us
 
 Final example: SET 2,B
You've got an instruction prefix. 1us.
 You've got another byte instruction. 1us.
 The ALU calculates B or (1<<2) simultaneously with the next instruction decode, so free
Total: 2us

 I've said it before, but it's worthwhile to download the Z80 UM and understanding what's actually happening. You can infer from the cycle times for each M cycle the kind of thing it's doing and start to understand a feel for how it works. As an example, ADD HL,rp is described as 4-4-3.

But certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...
« Last Edit: 01:45, 01 November 13 by ralferoo »

Offline Bruce Abbott

  • CPC664
  • ***
  • Posts: 53
  • Country: nz
  • Liked: 84
Re: Beware of slow instructions
« Reply #32 on: 04:02, 01 November 13 »
SET still only uses 7 T states,
According to my Zilog Z80 CPU User’s Manual SET bit,r uses 8 T states. But no matter, it works out the same in the end. 

Quote
But certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...
I have done some more investigation, and it looks like most of the exceptions occur when a machine cycle takes 5 T states, as the CPU is then forced to wait for the next memory slot even if the next cycle only takes 3 T states!

So to calculate the total time, take each individual T state time and round it up by 4, then divide the total by 4.

For example:-

INI
 
T States (4, 5, 3, 4) -> 4, 8, 4, 4 = 20 (T states + wait states) -> 5us
     
 

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.451
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 704
Re: Beware of slow instructions
« Reply #33 on: 07:25, 01 November 13 »
I was pleasantly surprised to find that the timing was identical when running under WinAPE.

So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
* Using some of the hardly used Amstrad compilers :D
* I use Firmware in my Assembly code :P
* Have interpreted some BASIC 1.1 programs for BASIC 1.0. :)

Offline arnoldemu

  • Supporter
  • 6128 Plus
  • *
  • Posts: 5.329
  • Country: gb
    • Unofficial Amstrad WWW Resource
  • Liked: 2220
Re: Beware of slow instructions
« Reply #34 on: 10:53, 01 November 13 »
So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
Many emulators currently consider the time for the whole instruction for the timings you would see on a cpc.
Zilog documentations normally write in T states, because this is how the cpu operates.
But the CPC video logic tells the z80 to pause. This means that instructions are forced to a multiple of 4T states or 1us cycles.
So, when talking about CPC we need to consider this.

Timing on the spectrum is different, their video hardware forces different pauses on the z80, and it differs with each Spectrum model.
48k has different timings compared to 128k+3. Thankfully on CPC, CPC and Plus have the same overall instruction timings.

Some emus are now considering the exact timings, which consider exactly when the z80 in the cpc reads/writes to memory and I/O. This has more of an effect on when results are seen on the screen, especially when you think of changing the palette rapidly. In addition to this, the results you see are also "shifted" depending on the gate-array in the cpc. (So for example, if you write to a palette register using the same method on the cpc and plus, at the same point in the frame, then then it's likely that the actual colour you see on the screen is in a different position, this time the timing is down to when the video accepts the change and performs it).

But, what is clear in this discussion is which instructions are quick and which are slow, and what cases they are useful in.

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

Offline TFM

  • Visit the mysteries of the CPC at www.futureos.de
  • Supporter
  • 6128 Plus
  • *
  • Posts: 9.899
  • Country: aq
  • Space Chicken for FutureOS is free!
    • index.php?action=treasury
    • FutureOS - The revolution on CPC!
  • Liked: 1972
Re: 32-character-width screen mode
« Reply #35 on: 21:19, 01 November 13 »
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Offline TFM

  • Visit the mysteries of the CPC at www.futureos.de
  • Supporter
  • 6128 Plus
  • *
  • Posts: 9.899
  • Country: aq
  • Space Chicken for FutureOS is free!
    • index.php?action=treasury
    • FutureOS - The revolution on CPC!
  • Liked: 1972
Re: 32-character-width screen mode
« Reply #36 on: 21:23, 01 November 13 »
Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:

Code: [Select]
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc


That check for overflow doesn't make it more precise. But you can use your example of course (the quick way) for 8 scanlines. However I was driving at pixel precise movement. But ok, this may not be always needed.
There is an own ideal strategy for any purpose. For general routines, code should be able to copy with all that.




BTW: One doesn't have to use 8 scanlines / I mean &800 blocks, the CRTC offers more convenient solutions. Few games are using them though.

« Last Edit: 21:25, 01 November 13 by TFM »
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Offline redbox

  • Supporter
  • 6128 Plus
  • *
  • Posts: 1.751
  • Country: gb
    • redbox
  • Liked: 326
Re: Beware of slow instructions
« Reply #37 on: 15:23, 02 November 13 »
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.

So if I'm using this counter to optimise routines, 1) can I assume that a lower number is faster and 2) whats the total on this timer for 1 frame on the CPC?

Offline Bruce Abbott

  • CPC664
  • ***
  • Posts: 53
  • Country: nz
  • Liked: 84
Re: Beware of slow instructions
« Reply #38 on: 22:58, 03 November 13 »
1) can I assume that a lower number is faster
Yes.
Quote
2) whats the total on this timer for 1 frame on the CPC?
There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is. Therefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Quote from: TFM
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
Good point. Most Z80 programmers try to avoid using IX and IY when they need speed, but on the CPC this may not always be the best strategy. Before deciding which instructions to use you should compare their timings on the CPC.   
 

Offline Executioner

  • Supporter
  • 6128 Plus
  • *
  • Posts: 783
  • Country: au
  • WinAPE Developer
    • WinAPE
  • Liked: 390
Re: Beware of slow instructions
« Reply #39 on: 04:25, 05 November 13 »
Yes.There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is.

WinAPE has microsecond accurate instruction timing for all instructions and accurate timing for interrupts (which can cause some instructions to delay slightly further). These timings have been measured and tested on both WinAPE and the real hardware.

Just to clarify (I posted more on this in another thread recently). The Z80 may get some wait states added on each instruction depending on the alignment with the clock. It appears the WAIT signal is only released for 1 of every 4 clock cycles. This means it's not actually possible to calculate the number of microseconds for a Z80 instruction simply by using the number of T-States or M-States as specified in the Zilog documentation, unless you look at the exact points the wait-states can be inserted. You are better to use the values found on other CPC specific timing documents on this site. If interrupts are likely to occur and you need accurate timing, it's a little more complicated.

Quote
Therefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Close, it's actually 312 * 64 = 19968 microseconds per frame.

Offline cpcitor

  • The user previously known as FindYWay
  • CPC6128
  • ****
  • Posts: 235
  • Country: fr
  • My heart still runs on traditional CPC.
    • My code for the CPC.
  • Liked: 111
Re: Beware of slow instructions
« Reply #40 on: 16:17, 05 November 13 »
Close, it's actually 312 * 64 = 19968 microseconds per frame.

Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC.

Offline Executioner

  • Supporter
  • 6128 Plus
  • *
  • Posts: 783
  • Country: au
  • WinAPE Developer
    • WinAPE
  • Liked: 390
Re: Beware of slow instructions
« Reply #41 on: 00:20, 08 November 13 »
Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?

Yes, that is correct.

Offline ralferoo

  • Supporter
  • 6128 Plus
  • *
  • Posts: 966
  • Country: gb
  • Liked: 579
Re: 32-character-width screen mode
« Reply #42 on: 13:49, 09 November 13 »
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
The timing on these instructions is totally as one would expect. There's not even a hint of the unexpected here.

LD HL,(xxxx) and LD (HL),xxxx were on the original 8080, so have normal opcodes (22 and 2A).
These always take 5us due to the memory accesses - IF, addrL, addrH, dataL, dataH.

LD rr,(xxxx) and LD (xxxx),rr are Z80 additional instructions and so are assigned codes in the ED space. There is a duplicated HL version here, but nobody uses it.
These always take 6us due to the memory accesses - IFprefix, IF, addrL, addrH, dataL, dataH.

LD Ir,(xxxx) and LD (xxxx),Ir are Z80 additional instructions but implemented with the IX/IY override bytes (DD or FD) but otherwise use the HL form. These ALWAYS take 1us longer than the equivalent HL form.
These always take 6us due to the memory accesses - IFoverride, IF, addrL, addrH, dataL, dataH.

Offline fgbrain

  • CPC6128
  • ****
  • Posts: 200
  • Country: gr
    • index.php?action=treasury
    • Chaos CPC Homepage
  • Liked: 101
Re: 32-character-width screen mode
« Reply #43 on: 11:00, 17 November 15 »

back to our initial subject..

a 32 X 32 char.  (64 x 256 bytes) screen can have this optimized way to calculate screen addresses, instead of this difficult and  slow equation which is

PROBLEM:
How to calculate in asm this equation
Code: [Select]
ADR=&C000+(Y\8)*64+(Y MOD 8)*&800+X
where X is 0-63 and Y is 0-255


SOLUTION
1. we create a 512 bytes (256 pixels in y) containing each next line in Y, say at &A400  (&C000,&C800,&D000,.....).  WARNING:  each address is not stored normally but like this:
    &C000 :   &00 poked at &A400   and &C0 poked at &A500  (+256 bytes),
    &C800 :   &00 poked at &A401   and &C8 poked at &A501  (+256 bytes), etc etc...
This is vital for speed purposes (8bit number madness).

2. now we can use this easy routine:


Code: [Select]
HL H=X and L=Y coordinates where X is 0-63 and Y is 0-255
LD A,H:LD H,&A4:ADD A,(HL):INC H:LD H,(HL):LD L,A
HL = screen address


« Last Edit: 20:54, 21 November 15 by fgbrain »
_____

6128 (UK keyboard, Crtc type 0/2), 6128+ (UK keyboard), 3.5" and 5.25" drives, Reset switch and Digiblaster (selfmade), Inicron Romram box, Bryce Megaflash, SVideo & PS/2 mouse, , Magnum Lightgun, X-MEM, X4 Board, C4CPC, Multiface2 X4, RTC X4 and Gotek USB Floppy emulator.

Offline rk last

  • CPC664
  • ***
  • Posts: 55
  • Country: 00
  • Liked: 41
Re: 32-character-width screen mode
« Reply #44 on: 02:21, 13 August 16 »
When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.

maRK

Offline AMSDOS

  • Supporter
  • 6128 Plus
  • *
  • Posts: 3.451
  • Country: au
    • index.php?action=treasury
    • Programs for Turbo Pascal 3
  • Liked: 704
Re: 32-character-width screen mode
« Reply #45 on: 06:44, 13 August 16 »

When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.


If it's in MODE 0 I do:


x=x*32 - where x is 0..19, but that's for 20-character-width, so: x=x*20 will give you 32.


LOCATE is purely text only, so TAG..TAGOFF & MOVE can be used to position the text.


And for y I use y=398-(y*16) - where y is 0..24.


But you're talking about the Spectrum mode, which is 32 character width in mode 1.
« Last Edit: 06:47, 13 August 16 by AMSDOS »
* Using some of the hardly used Amstrad compilers :D
* I use Firmware in my Assembly code :P
* Have interpreted some BASIC 1.1 programs for BASIC 1.0. :)