News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_ervin

32-character-width screen mode

Started by ervin, 15:11, 01 August 13

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

cpcitor

Quote from: Bruce Abbott on 07:00, 31 October 13
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

Woops, my mistake. I should have double checked the chart. :-[ You're totally right.  :)

Thanks Bruce for this test and the program! It can be used as a basis for more performance computation (see Using emulator for performance measurement and profiling ?).

Tied to Y MOD 8 = 0 ?

So, using SET and RES speeds up the "next_scr_line" part of the routine.
You mentioned it also forces to store sprite data in a different order (which is actually Gray code order).

Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

Axelay

Quote from: cpcitor on 11:09, 31 October 13


Doesn't the SET/RES version also tie Y to positions equal to 0 modulo 8 ?


If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.

redbox

Quote from: Bruce Abbott on 07:00, 31 October 13
NOP 1.0us
SET 3,D 2.0us
LD A,D : ADD A,8: LD D,A 4.0us 

So what unit does the WinApe debugger use?  I get 5 for SET and 7 for LD A etc.

Quote from: Axelay on 11:58, 31 October 13
If you set up your sprite data for a character aligned Y co-ord yes, but you can also gain a speed benefit with sprites aligned to 2 or 4 pixel lines rather than the full 8 lines of a character.  Just a matter of choosing the right balance for your game.

I agree.  In a non-scrolling game, for tiles I use SET, RES etc as they are always on a boundary.  For sprites I use a combination of SET and a Next Line routine because they are always bound to a multiple of 2 (moving 2 pixels at a time makes much more sense on a CPC).

Bruce Abbott

Quote from: redbox on 12:13, 31 October 13
So what unit does the WinApe debugger use?
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction. On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.

QuoteI get 5 for SET and 7 for LD A etc.
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
           

TFM

SET and RES in sprite code is very, _VERY_ limited. It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

redbox

Quote from: TFM on 22:10, 31 October 13
It works well, as long as you move your sprite _ONLY_ in X, but as soon as you need to move it up and down it will not work any longer.

Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:


...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc

ralferoo

#31
Quote from: Bruce Abbott on 20:19, 31 October 13
I can't find any documentation for WinAPE's 'T' value, but in practice it is equal to the execution time in microseconds. I suspect it gets this number by counting the number of M cycles per instruction.
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.
Quote
On the CPC, all M cycles take 1us (4 T states) as any that would normally take less are 'stretched' by adding wait states.
Exactly. Well kind of. Some instructions with an M-cycle that contains 4 T states can be stretched too in some cases, e.g. OUT (C),r takes 4-4-4, but gets stretched to 4-5-4 and so takes 4us on CPC.
Quote
SET b,r uses 8 T states and 2 M cycles, which takes 2us on a 4MHz Z80.
Each LD r,r uses 1 M cycle, and ADD A,n uses 2 M cycles. Therefore the instruction sequence LD A,d:ADD A,8:LD A,d uses 4 M cycles.
SET still only uses 7 T states, it's just there's an extra wait state. But otherwise, yes, exactly right.

The easiest way of thinking about the clock cycles on the Z80 are:
1us for every memory access (including instruction fetch, so 2us for a 2 byte instruction, etc)
1us extra for a 16-bit math operation (e.g. ADD HL,DE) where the ALU does 2 cycles instead of 1
1us extra for an IO access
1us if things can't be pipelined (e.g. PUSH versus POP)
A few other places where an extra 1us gets introduced as wait states stretch another M cycle.

I'll illustrate with the PUSH/POP thing. Take POP first.
You've got a single byte instruction. 1us.
You've got a low-byte read from (SP). 1us.
Whilst that's happening, SP is incremented and the result is ready for the next read.
You've got a high-byte read from (SP). 1us.
Whilst that's happening, SP is incremented and the result is ready for the next instruction.
Total 3us.

For PUSH, it's similar.
You've got a single byte instruction. 1us.
Before the first write, SP must be decremented. 1us.
You've got a high-byte write to (SP). 1us.
Whilst that's happening, SP is decremented again.
Finally you've got a low-byte write to (SP). 1us.
Total: 4us.

Another example, ADD A,B
You've got a single byte instruction. 1us.
The ALU calculates A+B simultaneously with the next instruction decode, so free.
Total: 1us

Another example: ADD HL,DE
You've got a single byte instruction. 1us.
The ALU calculates E+L. 1us
The ALU calculates D+H+carry. 1us (actually, I don't know why this isn't pipelined!)
Total: 3us

Another example: ADD IX,DE
Instruction prefix to modify HL to IX. 1us.
As per ADD HL,DE. 3us.
Total: 4us

Final example: SET 2,B
You've got an instruction prefix. 1us.
You've got another byte instruction. 1us.
The ALU calculates B or (1<<2) simultaneously with the next instruction decode, so free
Total: 2us

I've said it before, but it's worthwhile to download the Z80 UM and understanding what's actually happening. You can infer from the cycle times for each M cycle the kind of thing it's doing and start to understand a feel for how it works. As an example, ADD HL,rp is described as 4-4-3.

But certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...

Bruce Abbott

Quote from: ralferoo on 00:41, 01 November 13SET still only uses 7 T states,
According to my Zilog Z80 CPU User's Manual SET bit,r uses 8 T states. But no matter, it works out the same in the end. 

QuoteBut certainly the timing for most instructions is down to the number of memory accesses. The number of exceptions is relatively few...
I have done some more investigation, and it looks like most of the exceptions occur when a machine cycle takes 5 T states, as the CPU is then forced to wait for the next memory slot even if the next cycle only takes 3 T states!

So to calculate the total time, take each individual T state time and round it up by 4, then divide the total by 4.

For example:-

INI

T States (4, 5, 3, 4) -> 4, 8, 4, 4 = 20 (T states + wait states) -> 5us
     
 

AMSDOS

Quote from: Bruce Abbott on 07:00, 31 October 13
I was pleasantly surprised to find that the timing was identical when running under WinAPE.

So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

arnoldemu

Quote from: AMSDOS on 06:25, 01 November 13
So does that mean that the Emulators don't consider the Clock Cycles with regard to Assembly Instructions?
Many emulators currently consider the time for the whole instruction for the timings you would see on a cpc.
Zilog documentations normally write in T states, because this is how the cpu operates.
But the CPC video logic tells the z80 to pause. This means that instructions are forced to a multiple of 4T states or 1us cycles.
So, when talking about CPC we need to consider this.

Timing on the spectrum is different, their video hardware forces different pauses on the z80, and it differs with each Spectrum model.
48k has different timings compared to 128k+3. Thankfully on CPC, CPC and Plus have the same overall instruction timings.

Some emus are now considering the exact timings, which consider exactly when the z80 in the cpc reads/writes to memory and I/O. This has more of an effect on when results are seen on the screen, especially when you think of changing the palette rapidly. In addition to this, the results you see are also "shifted" depending on the gate-array in the cpc. (So for example, if you write to a palette register using the same method on the cpc and plus, at the same point in the frame, then then it's likely that the actual colour you see on the screen is in a different position, this time the timing is down to when the video accepts the change and performs it).

But, what is clear in this discussion is which instructions are quick and which are slow, and what cases they are useful in.

My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

TFM

BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

TFM

#36
Quote from: redbox on 00:04, 01 November 13
Not true. A simple example would be to use of SET if the sprite moves 2 pixels at a time in the Y axis:


...Do line...
set 3,d
...Do line...
Add &8, check for overflow
...Do line...
set 3,d
...Do line...
Add &8, check for overflow
etc



That check for overflow doesn't make it more precise. But you can use your example of course (the quick way) for 8 scanlines. However I was driving at pixel precise movement. But ok, this may not be always needed.
There is an own ideal strategy for any purpose. For general routines, code should be able to copy with all that.




BTW: One doesn't have to use 8 scanlines / I mean &800 blocks, the CRTC offers more convenient solutions. Few games are using them though.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

redbox

Quote from: ralferoo on 00:41, 01 November 13
T is a T-state in Z80 parlance. If there were no wait states, this is how many clock cycles it would take.

So if I'm using this counter to optimise routines, 1) can I assume that a lower number is faster and 2) whats the total on this timer for 1 frame on the CPC?

Bruce Abbott

Quote from: redbox on 14:23, 02 November 131) can I assume that a lower number is faster
Yes.
Quote2) whats the total on this timer for 1 frame on the CPC?
There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is. Therefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Quote from: TFMBTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
Good point. Most Z80 programmers try to avoid using IX and IY when they need speed, but on the CPC this may not always be the best strategy. Before deciding which instructions to use you should compare their timings on the CPC.   
 

Executioner

Quote from: Bruce Abbott on 21:58, 03 November 13
Yes.There is no actual timer for T states. However WinAPE has a 'T' counter which shows  the number of microseconds since it was last reset. On the CPC every machine cycle that is not a multiple of 4 T states will have wait states added until it is.

WinAPE has microsecond accurate instruction timing for all instructions and accurate timing for interrupts (which can cause some instructions to delay slightly further). These timings have been measured and tested on both WinAPE and the real hardware.

Just to clarify (I posted more on this in another thread recently). The Z80 may get some wait states added on each instruction depending on the alignment with the clock. It appears the WAIT signal is only released for 1 of every 4 clock cycles. This means it's not actually possible to calculate the number of microseconds for a Z80 instruction simply by using the number of T-States or M-States as specified in the Zilog documentation, unless you look at the exact points the wait-states can be inserted. You are better to use the values found on other CPC specific timing documents on this site. If interrupts are likely to occur and you need accurate timing, it's a little more complicated.

QuoteTherefore every instruction takes a whole number of microseconds. There are  20,000 microseconds in one 50Hz video frame.

Close, it's actually 312 * 64 = 19968 microseconds per frame.

cpcitor

Quote from: Executioner on 03:25, 05 November 13
Close, it's actually 312 * 64 = 19968 microseconds per frame.

Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?
Had a CPC since 1985, currently software dev professional, including embedded systems.

I made in 2013 the first CPC cross-dev environment that auto-installs C compiler and tools: cpc-dev-tool-chain: a portable toolchain for C/ASM development targetting CPC, later forked into CPCTelera.

Executioner

Quote from: cpcitor on 15:17, 05 November 13
Does this mean that a CPC frame rate is not exactly 50Hz but 1000000/19968, about 50.08 Hz ?

Yes, that is correct.

ralferoo

Quote from: TFM on 20:19, 01 November 13
BTW... There are also some unexpected timing like for LD BC,(NNNN) and LD IX,(NNNN). They need the same amount of time, but one would expect that the instruction using IX is 1 ys slower. But here it's not.
The timing on these instructions is totally as one would expect. There's not even a hint of the unexpected here.

LD HL,(xxxx) and LD (HL),xxxx were on the original 8080, so have normal opcodes (22 and 2A).
These always take 5us due to the memory accesses - IF, addrL, addrH, dataL, dataH.

LD rr,(xxxx) and LD (xxxx),rr are Z80 additional instructions and so are assigned codes in the ED space. There is a duplicated HL version here, but nobody uses it.
These always take 6us due to the memory accesses - IFprefix, IF, addrL, addrH, dataL, dataH.

LD Ir,(xxxx) and LD (xxxx),Ir are Z80 additional instructions but implemented with the IX/IY override bytes (DD or FD) but otherwise use the HL form. These ALWAYS take 1us longer than the equivalent HL form.
These always take 6us due to the memory accesses - IFoverride, IF, addrL, addrH, dataL, dataH.

fgbrain

#43

back to our initial subject..

a 32 X 32 char.  (64 x 256 bytes) screen can have this optimized way to calculate screen addresses, instead of this difficult and  slow equation which is

PROBLEM:
How to calculate in asm this equation

ADR=&C000+(Y\8)*64+(Y MOD 8)*&800+X
where X is 0-63 and Y is 0-255



SOLUTION
1. we create a 512 bytes (256 pixels in y) containing each next line in Y, say at &A400  (&C000,&C800,&D000,.....).  WARNING:  each address is not stored normally but like this:
    &C000 :   &00 poked at &A400   and &C0 poked at &A500  (+256 bytes),
    &C800 :   &00 poked at &A401   and &C8 poked at &A501  (+256 bytes), etc etc...
This is vital for speed purposes (8bit number madness).

2. now we can use this easy routine:



HL H=X and L=Y coordinates where X is 0-63 and Y is 0-255
LD A,H:LD H,&A4:ADD A,(HL):INC H:LD H,(HL):LD L,A
HL = screen address



_____

6128 (UK keyboard, Crtc type 0/2), 6128+ (UK keyboard), 3.5" and 5.25" drives, Reset switch and Digiblaster (selfmade), Inicron Romram box, Bryce Megaflash, SVideo & PS/2 mouse, , Magnum Lightgun, X-MEM, X4 Board, C4CPC, Multiface2 X4, RTC X4 and Gotek USB Floppy emulator.

rk last

When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.

maRK

AMSDOS

#45

Quote from: rk last on 00:21, 13 August 16
When you work with a reduced display how do you retune it to allow a vaild LOCATE.  As I recall, there is a formula that executes a cursor xpos=1, ypos=1 to return to the top left hand corner or anywhere else correctly on screen.


If it's in MODE 0 I do:


x=x*32 - where x is 0..19, but that's for 20-character-width, so: x=x*20 will give you 32.


LOCATE is purely text only, so TAG..TAGOFF & MOVE can be used to position the text.


And for y I use y=398-(y*16) - where y is 0..24.


But you're talking about the Spectrum mode, which is 32 character width in mode 1.
* Using the old Amstrad Languages :D   * with the Firmware :P
* I also like to problem solve code in BASIC :)   * And type-in Type-Ins! :D

Home Computing Weekly Programs
Popular Computing Weekly Programs
Your Computer Programs
Updated Other Program Links on Profile Page (Update April 16/15 phew!)
Programs for Turbo Pascal 3

Powered by SMFPacks Menu Editor Mod