News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_doragasu

ROM board with a tiny DMA engine

Started by doragasu, 23:33, 14 January 17

Previous topic - Next topic

0 Members and 2 Guests are viewing this topic.

doragasu

I'm thinking about making a ROM board for the CPCs (not plus). I want to use a CPLD instead of discrete logic for the ROM register and the decoding logic.

While having a look to the signals in the CPC EXP connector, I saw the BUSREQ and BUSAK signals and thought: why not trying to fit in the CPLD a tiny DMA controller to accelerate copying data from ROM? Sounds nice to me, but I as I have not studied CPC internals, I don't know if this is possible, and if it has some limitations. Some things that I think might be troublesome:

       
  • RAM refresh: I would like to avoid having to handle refresh. Will this impose a limit on the maximum block size I can copy without losing data?
  • Collision with internal CPC peripherals: to do DMA, is it enough just to request the bus (BUSREQ) and start when I get the BUSAK? Or should I be careful with other CPC internal peripherals that might access the bus while I have been given ownership by the CPU (e.g. CRTC accessing RAM to draw the screen).
Any suggestion or warning is welcome  ;D

rpalmer

doragasu,

Good luck trying to get a DMA to work, but from what I understand the Z80 "wait" signal is used to get video data and if you use this as well you will have trouble.

The BUSREQ/BUSACK are matched pairs for block transfers and the transfer must be such that the data transfer speed does exceed the speed of the memory itself.

One other factor to account for is interrupt processing which further complicate a DMA implementation and another is that external memory expansions may have other limitations.

rpalmer

doragasu

Thanks for the info. It looks like it will not be easy, so I think I'll abandon the idea  :( . I have found an expansion board using an Intel 8237 DMA controller, but it is for Aleste CPC clones, that had extra pins for DMA in the expansion port...

robcfg

Could you take some pictures of that board, please?


Sent from my iPhone using Tapatalk

doragasu

I'm sorry but I don't have such pictures. When I say "I found an expansion board...", I mean I found data about it on the Internet. Fortunately somebody did already took photographs, you can find them here.

robcfg

Cool, thanks!


Sent from my iPhone using Tapatalk

doragasu

OK, I have been reading here and there, and having a look to the CPC6128 schematics, and I think I'm missing some details to decide about whether the DMA is or not realizable, so here are some more questions...

1.- READY signal pulse duration: I have read that each microsecond, the GA pulls low READY signal to introduce WAIT states on the CPU while it fetches two bytes from memory. The reference above also states that this causes instruction executed times to be rounded to 1us, so if I understand properly this mechanism, between one and three wait states are inserted depending on the instruction being executed. Is this right? How is this accomplished? I don't think the GA is partially decoding instructions to know how many WAIT states to insert, I'm sure it must be a quite simple mechanism (that doesn't come to my mind).

2.- CPU WAIT states insertion: READY signal is used to insert WAIT states on the CPU. If I understand this mechanism correctly, this signal must only be asserted during memory reads and writes. So how does the GA know when to pull low READY signal? Again I don't think the GA partially decodes instructions...

3.- /CPU signal: while browsing schematics, I can see that /CPU and /CASADDR signals from the GA, are both used to multiplex CPU and CRTC addresses (also rows and columns). So, I suppose /CPU signal should be 1 MHz with a 25% duty cycle, right? Also how are /CPU and READY signals related? Is the rising edge of /CPU synchronized with the falling edge of READY?

Is there detailed documentation about how exactly these signals work?

PulkoMandy

You can look at the "gate array decapped" thread, where pople have reverse engineered the gate array from pictures of the die. It is indeed a rather simple chip.


It just locks the CPU bus whenever it needs to access the RAM. Sometimes the CPU does not need the BUS, and no extra wait state results. Sometimes the CPU needs to access the bus, and it has to wait.


If you want to do DMA, you can use the BUSRQ/BUSAK signals to lock the z80 from accessing the bus. But, you can't stop the gate array! This means your DMA engine will need to watch for the WAIT/READY actions from the gate array and only use the remaining cycles. Possible, but probably not that simple.

doragasu

Thanks for the info!

So it just generates 1 wait state each microsecond "without looking" what's happening on the CPU, right? And if the CPU is not reading/writing, it just ignores the wait state, right?

PulkoMandy

Yes, that's the idea. I'm not sure about the exact sequencing of what happens on the pins, but the CPU will wait only when it tries to access the bus, during wait states it can still perform any internal operations.

rpalmer

Pulkomandy,

The WAIT/READY as I understand it is mostly for Video data access.
You can use this on a real Z80 DMA to pause the transfers, but as to how effective a DMA becomes is another issue.

See attached, CE/WAIT details for how this works.

rpalmer

doragasu

#11
I have quickly browsed the Z80 DMA engine, and I have seen that it takes at least 3 cycles (750ns@4MHz) to read a byte, and another 3 to write it. Cycle time of the RAM chips is 270ns, so maybe ìt could be cut to 2 cycles to read and 2 to write. This is of course without taking into account the READY signal.

Unfortunately there is no access to /CAS and /RAS signals. It would have allowed to do page reads/page writes in 1 cycle each (excepting the first one).

It would be interesting to see the timing of the /CAS and /RAS signals generated by the GA.

doragasu

#12
I have coded a preliminary implementation of the DMA, that reads from flash/ROM (activating fn_oe and fn_ce) and writes to CPC RAM (activating n_wr and n_mreq). On this preliminary implementation, each read/write takes two CPU cycles (I suppose this can be optimized, more on this later). I have also implemented wait states.

I wrote a dirty simulation that copies 3 bytes using DMA, from ROM at $C000 to RAM at $1000. On the second byte write I have inserted a wait state, and two on the third write. This is the resulting simulation. Tips/corrections are welcome:



Now let's start with the questions:

1. How does it look? Is there anything wrong? Should this work?
2. As I wrote before, this can be optimized. At least I'm sure that I can read from Flash/ROM in a single cycle (Flash chip is rated 90 ns). But can I write to RAM also in a single cycle? I suspect it might be possible. Why? Because of two reasons:
- First: GA reads. I think I read somewhere (I cannot find where) that each microsecond, the GA inserts a wait state, and during it, it reads TWO BYTES. So if in a single cycle, two bytes are read, it should be possible to read just one, shouldn't it?
- Second: The wait states mechanism. If my interpretation of CPC6128 schematics AND the wait states mechanism is right, for example if the CPU wants to read a byte, and the GA inserts a wait state at T2 (see graph below), the CPU buses are "disconected" from RAM (multiplexed) for the CRTC/GA to read the data. When the wait state is removed, the CPU continues the read operation: CPU address and control signals are applied to the RAM, that must perform the read IN A SINGLE CYCLE (T3 in graph below). Is this correct? If affirmative, a single cycle must be enough for the RAM to be read (or I am missing something).



But there is also something that contradicts this... I had a look to the SRAM datasheet (HM4864-2) and it has a 270 ns cycle time... a bit more than a 250 ns cycle... So maybe my interpretation of the schematics and/or the wait states mechanism is wrong...

Bryce

Liked. Not because I think you'll manage it, but because you've gone to the bother of trying it.

Bryce.

doragasu

Quote from: Bryce on 22:46, 24 January 17
Liked. Not because I think you'll manage it, but because you've gone to the bother of trying it.

Bryce.

Hehe, thanks for my first like on this forum.

It looks like most people thinks that it is not doable, or that it is very very difficult to get DMA to work, but I'm still missing a detailed explanation about why. I understand that I must make wait states to pause the DMA (using READY signal), and that interrupts might cause problems. But if I implement the wait states and if interrupts are disabled when using DMA (what makes sense, because during DMA the CPU is halted), I don't see why this should not work. What am I missing?

arnoldemu

Quote from: doragasu on 07:39, 25 January 17
Hehe, thanks for my first like on this forum.

It looks like most people thinks that it is not doable, or that it is very very difficult to get DMA to work, but I'm still missing a detailed explanation about why. I understand that I must make wait states to pause the DMA (using READY signal), and that interrupts might cause problems. But if I implement the wait states and if interrupts are disabled when using DMA (what makes sense, because during DMA the CPU is halted), I don't see why this should not work. What am I missing?
I don't know enough about the cpc hardware signals so what I am saying can't be treated as fully correct.

From what I understand, using ready to pause the cpu should be ok. I don't see how it would cause a problem for the gate-array or the cpu.

True it may cause interrupts to be delayed but i don't think they will be missed.
Gate-Array will assert them until the cpu answers.

What you may see is that if interrupts delay by more than 32*64us then the next interrupt is moved and the one that synchronises with the vsync may not happen.

BASIC and firmware may see this as a missed interrupt. But for programs that enable and use the DMA and know about it, this is no problem at all.

Which method of DMA are you choosing?

Is it one that interleaves it's access with the cpu so that it runs at the same time and cpu speed is not changed or one that will halt the cpu?

The Atari ST blitter and Amiga blitter can do both.

My thoughts about DMA is that transfering data using the cpu is about 2/3us per cycle at the best, more generally it's around 5us (LDI). so if you can read/write bytes faster than that then the DMA will be better. 1us read, 1 us write is not bad, 1 us read/write would be great.

Another thought, the gate-array controls access to ram for the z80, there is a 74ls which opens the bus to the z80. What I don't know is if the bus is open and it closes the bus for it's access or not. Now, because the gate-array controls the ram, it can read two bytes by toggling cas/ras (not sure which) quickly to read two bytes at a time. You will not get access to these signals from the z80 side.

I think of the ram living on the gate-array side and it is the gate-keeper and opens the gates for the z80 when it wants to.
My games. My Games
My website with coding examples: Unofficial Amstrad WWW Resource

rpalmer

"Now, because the gate-array controls the ram, it can read two bytes by toggling cas/ras (not sure which) quickly to read two bytes at a time. You will not get access to these signals from the z80 side."

The RAS/CAS signal are for setting up the internal RAM address and not read/write two bytes at a time. You can think of this as setting up the lower address and upper address of RAM within each chip.

The use of the ready line is (as i understand it) for access to RAM by the CRTC, so any use of it will most likely cause the display to "glitch" during display refresh.  This may not be a problem in the short term, but will get annoying if it occurs to many times.

Another issue is interrupts which the CPC uses for the time the machine was last reset/switched on. Stopping this will effective leave timings out of sync for programs which rely on it (not to mention VSYNC issues).

Also the GA is not the master, but rather a cooperative friend as it interlaces video access with normal Z80 access and this is why the use of DMA is so difficult to get right on the CPC.

You could implement the Z80 DMA chip, but then the DMA transfer would be suspended whenever an interrupt is seen or when the "READY" line is active, but then you need the ready line to suspend the Z80. So it is like trying turn the page of a book while you stand on it.


rpalmer

doragasu

Thanks for the information and thoughts, it is much appreciated!

Quote from: arnoldemu on 09:15, 25 January 17
Which method of DMA are you choosing?

Is it one that interleaves it's access with the cpu so that it runs at the same time and cpu speed is not changed or one that will halt the cpu?

I'm using #BUSREQ to stop the CPU, I'm afraid. I didn't even thought about interleaving, and THAT for sure would be difficult. As the READY signal from GA is not directly tied to the latch (it goes through a 82 ohm resistor), I might drive the line and add additional wait states. But that for sure would be a challenge, and also I'm afraid I might stress the GA READY output.

Quote from: arnoldemu on 09:15, 25 January 17My thoughts about DMA is that transfering data using the cpu is about 2/3us per cycle at the best, more generally it's around 5us (LDI). so if you can read/write bytes faster than that then the DMA will be better. 1us read, 1 us write is not bad, 1 us read/write would be great.

I'm positive that I can read from flash in 1 cycle. I suppose I can write in two cycles, maybe one could be possible. But as I do not know exactly how the GA works, I will not know until I test it. Add one wait state, and I should be able to read/write once each microsecond. But I could be wrong.

Quote from: arnoldemu on 09:15, 25 January 17You will not get access to these signals from the z80 side.

Yeah, too bad I cannot access the signals, and also I have seen no detailed timing information about them.

doragasu

#18
Quote from: rpalmer on 11:13, 25 January 17The RAS/CAS signal are for setting up the internal RAM address and not read/write two bytes at a time. You can think of this as setting up the lower address and upper address of RAM within each chip.

I think he is talking about Page Mode read, which can substantially accelerate sequential reads. You select the row using #RAS and then select the column using #CAS. Then if you want to read another column that is in the same row (page), you can immediately select the column using again #CAS without the need to select the row. Using this method, reads should take around 160 ns, instead of the usual 270 ns. But this doesn't explain how the GA is supposed to read two bytes in a wait cycle (250 ns).

Quote from: rpalmer on 11:13, 25 January 17Another issue is interrupts which the CPC uses for the time the machine was last reset/switched on. Stopping this will effective leave timings out of sync for programs which rely on it (not to mention VSYNC issues).

Hum, I have to investigate this issue. Anyway IIRC #BUSREQ has priority over #IRQ, so maybe I do not even need to disable interrupts, the DMA will just delay them (and the user must make sure the transfer is small enough not to delay them too much).

Quote from: rpalmer on 11:13, 25 January 17You could implement the Z80 DMA chip, but then the DMA transfer would be suspended whenever an interrupt is seen or when the "READY" line is active, but then you need the ready line to suspend the Z80.

About the interrupts, I'm not sure as I wrote above. And about pausing DMA during WAIT cycles, I don't think that it would slow DMA too much (and for sure not more than it already does to the CPU).

robcfg

As for how the GA works, see the "Gate Array Decapped" thread, in the Hardware section of the forum.


Sent from my iPhone using Tapatalk

gerald

Quote from: rpalmer on 11:13, 25 January 17
Also the GA is not the master, but rather a cooperative friend as it interlaces video access with normal Z80 access and this is why the use of DMA is so difficult to get right on the CPC
It's the exact opposite.
The GA is the master, the Z80 just obey the READY/Waitn signal to synchronise to its allocated DRAM access slot.
If you obey the rules, there should not be any issue having a DMA access.
Rules are :
- You must request the bus to the Z80 (Busreq/busask)
- one access bus every 1ms, that is between 2 WAITn you can only access one address. You shall respect the WAITn !
- The address shall be stable half a 4MHz cycle before the read (like the in Z80 timing diagram). This is to allow the address to be stable for the RAS cycle.

doragasu

Quote from: gerald on 20:11, 25 January 17Rules are :
- You must request the bus to the Z80 (Busreq/busask)
- one access bus every 1ms, that is between 2 WAITn you can only access one address. You shall respect the WAITn !
- The address shall be stable half a 4MHz cycle before the read (like the in Z80 timing diagram). This is to allow the address to be stable for the RAS cycle.

I was suspecting there was a restriction like this. Why only one address per microsecond?

gerald

Quote from: doragasu on 21:28, 25 January 17
I was suspecting there was a restriction like this. Why only one address per microsecond?
I made a typo : one access per microsecond (µm), not milisecond (ms).
The GA is doing all the magic of translating the access to the DRAM (ras/cas generation, address multiplexing), and only one access per microsecond is supported.

doragasu

OK, clear then. I have to give a read to the GA decapping thread to read the juicy details. Anyway, for ROM to RAM writes I should be able to do a read/write every microsecond even with this restriction.

Thanks!

doragasu

#24
I have started analysing the GA using the schematic from the GA decapped thread. Here you can have a look to some of the signals during a complete sequence (1 us):



Almost everything makes sense... excepting the READY signal. If you look at the #RAS and #CAS signals, you can see that each microsecond three reads are made: two for the GA (#CPU = 1) and one for the CPU (#CPU = 0). The two GA reads are done using page read mode, as I suspected (#RAS is strobed once, and #CAS is strobed twice). The CPU read is done only if #MREQ = 0 (and it is not a refresh cycle), otherwise #CAS is not lowered. It makes perfect sense...

... BUT... READY signal is lowered when the CPU performs the read, and not when the GA does it!!! Everywhere I have read just the opposite, so maybe my interpretation of the signals is just wrong...

Powered by SMFPacks Menu Editor Mod