Would a "Z80 corruption" type of demo be possible on the CPC?

Velktron · 12:07, 17 August 15

For anyone not familiar with the reference, I had the famous , as well as its sequel, in mind.

That technology allowed video playback on an original IBM PC with an 8088 CPU @ 4.77 MHz with a CGA card and a Sound Blaster (plus a hard disk, heh). The technology used in "Corruption" used CGA's text mode (so effectively 40 x 25 resolution video, with variable-shape "pixels") in order to fully show 16 colors, while "Domination" used CGA's composite artifacting in order to produce a more full array of colors, and use a proper graphical mode.

An interesting programming aspect is that at least in 8088 Domination, the "video format" doesn't actually need decoding at all: it literally consists of ready-to-execute optimized Intel 8088 instructions that simply write the desired data straight to the CGA, so it's literally almost zero-overhead, a technique which could be used for any CPU, including of course the 8088.

The biggest obstacles I see, are that the CPC has no true PCM audio, no DMA channels to help with audio or disk -> memory transfers, and of course no standard way to attach a hard disk to it, though there are a few ways to attach external streaming media to it, so everything would have to be CPU-driven.

Optimus · 12:22, 17 August 15

I bet it would be kinda possible to do videoish stuff, just have a look at all the stuff from Algorithm on the C64.
Not sure though about playing digital and updating enough screen changes at the same time.
Also, we lack of a hardware char mode.
And the 8088 demos where a lot of MBs on the hdisk. But Algorithm's stuff does it in one or two disk sides.
Maybe videos which doesn't move so fast, less changes from one frame to the next one.

I'd definitely like to see a bad apple adaptation, no matter how bad

Velktron · 13:31, 17 August 15

I just now noticed that Devilmarkus has recently produced a demo which shows full motion video on the CPC, synced with chiptune audio. I am not sure if it was influenced by 8088 Corruption/Domination in any way, though it requires a 4 MB RAM expansion to hold the video data.

OK, so that proves that video of some sort is possible on the CPC (the tougher part would be encoding it, actually, especially if that was attempted back in the 80s), but it is ever possible to come near the performance of 8088 Corruption/Domination (full frame, full motion, 30 or 60 fps, even if without PCM audio)? For the goold old CPC, being it a pinko Euro computer, 25/50 fps would suffice

Also, would it be possible to keep this performance up while combining it with streamed I/O or perhaps even real-time decompression?

Apollo · 16:46, 17 August 15

Well, the CPC has a large framebuffer zu fill compared to the other machines, 160x200 are 16.000 bytes already. Thats why big sprites were always more of a challenge on the CPC then on the other machines with more chunky access patterns or hardware sprites.
A simple "ld (hl),a" which is the fastest normal way to write into memory byte by byte, takes 2 us. To fill a overscan framebuffer with roughly 24.000 bytes, that would result to ~20,83 fps. And we didn't even change the pointer hl nor the written pixeldata in a.

I believe it is possible with certain video material on "chunky mode0" with ~190x120 that has on average below 1/3 of its pixels changed per frame and and the part changing not be too cluttered around the screen but more in 1-2 "blocks". But the problem is to encode that material very very efficient to not add noise and need to make useless changes around the whole screen.
But in my opinion it would only really work out on the plus with the help of programable interrupt and sound-dma, as on the CPC I see no way to be able to still load from disk while decrunching your data and writing to screen. Not to talk about sound while doing that.

Velktron · 17:22, 17 August 15

In 8088 Corruption, simple uncompressed framebuffers of 40 x 25 textmode "video" were written to the CGA, using 8 bits per character, and I think also 8 bits for color information, so 2 bytes per "pixel", or 2000 bytes per frame ( I don't know why some sources claim an 80 x 25 format, certainly doesn't look that way, unless it's a requirement for the encoder's input). In any case, at that rate it was fast enough for simple framebuffering.

The technique used in 8088 Domination (XDC) however is more refined than using "dumb" CPU power and/or large amounts of RAM buffers: the encoder actually converts video frames into optimized x86 code, which means that all sorts of tricks like e.g. overdraw avoidance, differential updates and RLE compression (a single row of all equal pixels or a white screen could be, in theory, be converted to a single x86 REP MOVSx instruction) would be possible, as well as very fast pattern and color fills, it all depends on the complexity of the encoder. The decoder, quite literally has to do nothing, as everything that is to be executed has been laid out there by the encoder in a directly executable form. The space, memory bandwidth and speed savings are such that the technique outperforms 8088 Corruption's in practically every aspect.

I wonder how much this would be portable to the Z80...it has the LDIR instruction, after all

What about a Z80 video format?

On the CPC, there would certainly be some benefits in that most of the CPU time would now be used for video rendering, without a clear separation from decoding/decompression, and being CPU time as a premium, that can only mean savings. Loading from disk and keep pace would certainly be an added challenge, I don't know if real-time operation would be possible ( it would depend on the "bitrate" of the source material and the amount of buffering).

But I believe that a significant portion of a sillhouette-based video like "Bad Apple" could be made to fit almost entirely in the RAM of a 128 KB machine, even if at a reduced resolution and chiptune music.

Damn...I've never taken up CPC programming (only some Z80 assembly at uni) but this sounds tantalizing

TFM · 17:29, 17 August 15

That was all done on CPC before, see French demos.

Velktron · 17:35, 17 August 15

Quote from: TFM on 17:29, 17 August 15
That was all done on CPC before, see French demos.

Can you point to a specific one? I would be surprised if 8088 Domination's technique (2014) was anticipated by CPC demos, but not so much if it has been already copied/adapted post-release.

arnoldemu · 17:40, 17 August 15

Quote from: Velktron on 17:35, 17 August 15

Can you point to a specific one? I would be surprised if 8088 Domination's technique (2014) was anticipated by CPC demos, but not so much if it has been already copied/adapted post-release.

http://www.cpc-power.com/index.php?page=detail&num=7542
http://www.pouet.net/prod.php?which=56415

arnoldemu · 17:42, 17 August 15

Quote from: Velktron on 17:35, 17 August 15

Can you point to a specific one? I would be surprised if 8088 Domination's technique (2014) was anticipated by CPC demos, but not so much if it has been already copied/adapted post-release.

not sure the technique has been used on cpc, but there have been some demos with video playback or video-like playback in them.

Velktron · 17:49, 17 August 15

Quote from: arnoldemu on 17:42, 17 August 15
not sure the technique has been used on cpc, but there have been some demos with video playback or video-like playback in them.

Well, me too, including a very recent one by our very own Devilmarkus, but those examples come nowhere near the performance and sophistication achieved by 8088 Domination, which as I explained, goes well beyond brute-force framebuffering. It's like having a decoder with LESS time penalties than a brute-force framebuffering routine

If an 8088 @ 4.77 MHz manages to pull off 30 fps at 160x200 (16 colors/256 artifact colors) or even 640x200 (2 colors) with CPU power to spare using that technique, then why not the Z80?

Alcoholics Anonymous · 18:27, 17 August 15

Quote from: Velktron on 17:49, 17 August 15

Well, me too, including a very recent one by our very own Devilmarkus, but those examples come nowhere near the performance and sophistication achieved by 8088 Domination, which as I explained, goes well beyond brute-force framebuffering. It's like having a decoder with LESS time penalties than a brute-force framebuffering routine

If an 8088 @ 4.77 MHz manages to pull off 30 fps at 160x200 (16 colors/256 artifact colors) or even 640x200 (2 colors) with CPU power to spare using that technique, then why not the Z80?

Video playback has been done many times on several z80 platforms over the past 10 years. Usually the source data comes over ethernet, hard drive or sd card.

One music video was made using a zx spectrum with video and various demo-like things (audio is not the zx of course). This was run from an sd card.

SymbOS has some sort of video format too:

That's an MSX TurboR which is a fast z80 and in that clip, three videos are played back at the same time.

Automatic tools exist on the spectrum to make videos without sound. I've seen one attempt with video and sampled AY sound but the quality of what can be achieved makes it more of a curiosity.

The 8088 demo you pointed at has a lot of hardware assistance that most z80 machines don't have. A character mode can cut down the amount of video memory that needs updating ten times. I would think its method of streaming executable code that updates the screen would be slower than sending raw data, especially for bitmapped displays as on most z80 machines. In the spectrum scene at least, delta updating has been tried with some success -- that's a simple means of updating changing areas from frame to frame. As soon as you try to execute instructions that do something other than writing to the display file, you're giving up quite a few pixels that can be updated in a frame. In spectrum z80 players, one screen byte (8 pixels) can be written in 5.5 to 10 cycles -- will any amount of streaming executable code be able to compete with that? Adding two 16-bit numbers together forfeits writing somewhere between 8 and 16 pixels.

Knowing that it always takes the same amount of time to update strips of the display (because the entire screen is written) also means AY sampled sound can be inserted into the draw code. Most z80 machines do not have dma to automatically feed sampled sound at regular rates so to get sampled sound played, the sound code has to be intermingled with the video update and played at precise intervals.

Anyway I'm not saying don't try it. Try it by all means. Just realize that 8088 is getting a lot of hardware help in that demo

TFM · 20:10, 17 August 15

Well, a long time ago I developed a compression format for the CPC. Actually it's way more than compression, it's what I called CoData. A mix of Code and Data. The "result" of a "compression" is a kind of program which can be executed. So I can do 50 fps fullscreen. Here some examples:

! No longer available

! No longer available

! No longer available

Velktron · 10:04, 18 August 15

Quote from: Alcoholics Anonymous on 18:27, 17 August 15The 8088 demo you pointed at has a lot of hardware assistance that most z80 machines don't have.

That's true, and it's what allows for seamless disk-to-RAM and RAM-to-soundcard transfers. But even without those facilities, the "clou" of the technology is the XDC "delta compiler" system: everything else being equal, at least on Intel x86, that proved the most efficient way to "throw" pixels at the screen, and that should be especially more so in a machine such as the Amstrad CPC where video is entirely memory mapped (actually, video RAM and main RAM are the one and the same).

Quote from: Alcoholics Anonymous on 18:27, 17 August 15In spectrum z80 players, one screen byte (8 pixels) can be written in 5.5 to 10 cycles -- will any amount of streaming executable code be able to compete with that? Adding two 16-bit numbers together forfeits writing somewhere between 8 and 16 pixels.

I am not certain how many cycles the simplest Z80 instruction that would write a byte to a specified location in memory takes, but on this page there's an interesting discussion on the performance of the LDIR instruction: the claim is essentially that it consumes 21 cycles per byte moved, and that even that is a massive improvement over a naive approach with a hand-coded loop, which requires 35. Of course on the Amstrad there's the problem of cycle rounding to multiples of 4....so make that 24 vs 36.

I don't doubt that it's possible to write a byte even faster: as this Z80 reference suggests, the speed of a single LD (HL),r operation should be 7 cycles (make that 8 for amstrad), but that must imply a severe size/speed tradeoff if only LD (HL),r operations are used to "throw" pixels at the screen, as you'll need memory for both the pixel data AND for the instructions, plus you'll need to increase the HL register manually, and an INC HL operation would "cost" another 7 cycles, at least, so in the end you'll have 14 cycles per pixel (blown to 16 for Amstrad).

Since a conditional branching instruction "costs" about 10 cycles, to avoid performing that for as long as possible you must unroll as many LD (HL),r and INC HL instructions as you can, so in the end this is a classic speed-space tradeoff, and there isn't really much of either on a Speccy or CPC. That average figure you quoted ("between 5.5 and 10 cycles") must have been calculated assuming a non-trivial amount of unrolling and speed-space tradeoff.

Perhaps blindly using the LDIR instruction is not the best approach here, as the REP MOV would be on Intel, but a "smarter" approach (like using a combination of self-modifying code and partially unrolled loops) might be needed. An even better approach would be to use the 16-bit LD instructions whenever that would help reduce the speed/space tradeoff somewhat.

Quote from: Alcoholics Anonymous on 18:27, 17 August 15Knowing that it always takes the same amount of time to update strips of the display (because the entire screen is written) also means AY sampled sound can be inserted into the draw code. Most z80 machines do not have dma to automatically feed sampled sound at regular rates so to get sampled sound played, the sound code has to be intermingled with the video update and played at precise intervals.

Even using a "codec" with a heavily variable bit/data rate such as XDC, there are some "fixed" timing points, such as when a line or the entire screen finishes updating. Essentially, the playback routine (which is ALSO the screen data itself!) must "yield" control to the system as soon as it completes a goal (completely drawing a line, completely drawing a screen, etc.), becoming asynchronous. Of course, you'll need an external timing system capable of handling such asynchronous events. Lacking that, the encoder itself must assure constant line/frame timing by both keeping max bitrate/execution time under a certain threshold, and even inserting deliberate delays and wait states, if necessary (e.g. if a line is easy" to draw and completes too quickly).

The interesting thing about similar codecs, is that a lot of the responsibility for creating proper timing and keeping the bitrate under an acceptable maximum, lies with the encoder. For example, a line that is "easy" to draw (all similar pixels) will be "easier" to express and faster to execute than one that has a more complex pattern. The encoder must ensure that even in the case of the maximum possible complexity, the bitrate will keep under a pace which will not affect sound sync or total frame timing, and that's not a trivial thing to do, as the encoder will have to "look" at an entire frame and even some past/future frames before making an optimal decision (that might even include looping over previously drawn lines, or skipping ahead in memory). On the converse, the encoder might even have to insert deliberate padding, pauses and waits if that would help with keeping sync up (e.g. if the sound routine really relied on the video for timing).Overall, this is a very interesting optimization and encoding problem, moreso than a programming one

Quote from: TFM on 20:10, 17 August 15Well, a long time ago I developed a compression format for the CPC. Actually it's way more than compression, it's what I called CoData. A mix of Code and Data. The "result" of a "compression" is a kind of program which can be executed. So I can do 50 fps fullscreen.

Come on man, you know you need to run Bad Apple through that bad boy

TFM · 15:28, 18 August 15

Sorry, I speak English a 2nd language, I don't understand what the reference to Apple means. If this is some kind of slang expression I don't get it, or it's not used in NOLA.

But what I know is that my CoData is more efficient that any "XDC delta compiler". In the worst case CoData needs only 3.5 us to write a byte to screen RAM, in the best case it needs 2 us. Bytes unchanged between two frames are not changed of course. I guess that 2 us is about 300% better than the 6 us of the LDIR, right?

I know that my posted examples are not beautiful, because I'm not a Tolkin or MacDeath or Carnivac or GFX person at all, but it shows that it can run with 50 fps in full screen, independent of the screen MODE.

So: The CPC can do it! And even better!

Velktron · 15:53, 18 August 15

Well, this is the Bad Apple I'm talking about:

If your encoder is efficient enough to allow reproducing any portion of it at 25 fps in Mode 2, even without sound, you may be in for a CPC first

[/size]

Quote from: TFM on 15:28, 18 August 15But what I know is that my CoData is more efficient that any "XDC delta compiler". In the worst case CoData needs only 3.5 us to write a byte to screen RAM, in the best case it needs 2 us. Bytes unchanged between two frames are not changed of course. I guess that 2 us is about 300% better than the 6 us of the LDIR, right?

As I explained, it's certainly possible to write a single byte (or even a 16-bit word) anywhere in memory faster than what LDIR can perform on average during a block move. However, you'll have to "pay" something in terms of space: LDIR can replicate large portions of memory with very little RAM usage, and sometimes it may pay to make this tradeoff, sometimes not. That's really an optimization problem the encoder must solve, depending on its input. By stringing together a bunch of immediate or register to memory instructions, sure, you may write bytes faster on average, but you will also eat up memory faster (don't forget, LDIR doesn't need to carry its own data into the instruction itself, just move it from memory to memory, and if the moves are from video RAM to video RAM, you can realize significant savings). A good encoder should be able to decide when to use immediate writes, and when to replicate existing blocks (this may also mean making decisions such as e.g. copy Line 137 from Frame 19 to Line 101 in Frame 20).

Optimus · 18:06, 18 August 15

Other techniques to fill the screen or copy/write bytes that have been used:

PUSH reg, 4 NOP cycles to write two bytes, so 2 NOP cycles per byte.
Unroll list of LDIs instead of LDIR. Later takes 6 NOP cycles, formet 5 NOP cycles per byte.
You can store the most frequent bytes in 8bit regs. For example, a bad apple b&w video in Mode 0, will have 3 byte types, full white, left pixel black and right white, and the opposite. Store them in the beginning of frame in B,C and D and you can do LD (HL),B C or D so no need to read value from source. 2 NOP cycles, instead of LD (HL),value which is 3.
But you have to also increase HL to go to the next byte. INC L sometimes is sufficient in some screen boundaries.

Or your bytes might be scattered. Then I think there is LD (nnnn),A (4 NOP cycles?) and LD (nnnn),HL (5 NOP cycles?). So you can init the most frequent 8bit or 16bit values appearing in the frame and then have series of LD (nnnn),reg. I think this was used in Backtro tunnel, not animation, but still it was reading frequent appearing texel values and writting many times.
It all depends on the data, so you search the optimal solutions per frame and use various techniques together. And of course no brute force frame, but Delta Packing (which means to only write the bytes that changed compared to the previous frame), already used in many animations on CPC (Overflow's 3d animation on the shadow of the beast preview disk).

Also, you might say that LDIR is better for size and it is, but the animation frame data won't always be series of LD (HL),A or LDI in the frame. But they can be information like, source/destination/size and calling in an address of code where I have 40 LDIs or LD (HL),A:INC L in a row with a RET at the end. So if I have to copy 10 times, after setting HL/DE, I call at the address after the first 30 LDIs. And LDIs are one cycle less than LDIR. And also LD (HL),A:INC L is 3 NOP cycles per byte.

There are many techniques already used on the CPC demos for other routines, some animation players have used them, it also depends on the data, you might prefer some data than others depending on the case. Batman forever also uses a unique technique for it's vector based animations, which is something like single line splitting combined with delta packing on 1D level. Regular technique for hardware splitline fx but something I have not seen before on animation on CPC. Says it can reduce a bit address storage, you don't store 16bit address of bytes to write, but 8bit address offsets on a single vram line. But many coders cared also more about size reduction rather than speed. No Recess demo Phreaks is kinda slow, but he searched techniques for storing more animation in limited size. I think he explained his technique and uses some kind of tiled compression.

Optimus · 18:09, 18 August 15

For a comparison of Bad Apple in retro machines.

This was the Mega Drive version that sparked the interest:

The C64 version:

PC 8088 version (second part on the 8088 Domination demo) - at 2:56

TFM · 18:10, 18 August 15

Quote from: Velktron on 15:53, 18 August 15
Well, this is the Bad Apple I'm talking about:

If your encoder is efficient enough to allow reproducing any portion of it at 25 fps in Mode 2, even without sound, you may be in for a CPC first

[/size]

NO problem at all. For the sound I have to ask Tom & Jerry though.

Velktron · 09:51, 19 August 15

Quote from: TFM on 18:10, 18 August 15

NO problem at all. For the sound I have to ask Tom & Jerry though.

Well, that would be awesome. Don't bother with the sound yet (unless you meant a chiptune rendition).

The various versions posted by Optimus are very interesting for comparison: the Genesis one probably might just be brute-force framebuffering, with bitmaps preconverted to its screen tile format, OR it might be a smarter "macroblock" format using some predetermined tile shaped (though with 2 levels, there would be 2^16 8x8 B&W tiles to choose from. I really hope they had a good optimizer

). It's also not just binary B&W (the original video had some grayscale), so I guess that's the best one visually.

The C64 version seems to use some form of poor man's MPEG, with visible macroblocks, which suggests that a tile mode with precalculated patterns which were simply being pointed at was used (hence the insane compression ratio). Pretty clever and effective, all things considered.

The PC version (which is the one that conceptually is closer to the capabilities of the Amstrad CPC, even the graphic modes are similar) OTOH simply uses line-based writes into a bit-planar display, which should be well-suited for the Amstrad CPC if not for the fact that the Z80 is nowhere near as efficient in performing block moves as the 8088, even disregarding all hardware help. Now, the encoder used on the PC (or at least the hardware used for playback) looks quite "dirty" in some scenes. I don't know if that's the hardware or the capture's fault, or if the encoder had to make compromises at some point (it certainly doesn't appear to be a lossless procedure).

Now, which compression technique and which sequence of commands will be the most efficient for encoding a video with CoData or XDC or whatever, will depend also on what kind of storage media and RAM is available. The fastest single-byte writing techniques on the Z80 also seem to consume a lot of RAM, so it will also have increased needs for I/O.

E.g. if you end up using 1 byte of program memory for each byte written into video RAM, then you will need to rebuffer very frequently. When you rebuffer, you're not drawing to the screen and you will be eating up memory and CPU cycles just the same, so any speed advantage might be lost.

As I said, this is first -and foremost- an optimization and tradeoff problem

remax · 15:54, 19 August 15

Quote from: Velktron on 15:53, 18 August 15
If your encoder is efficient enough to allow reproducing any portion of it at 25 fps in Mode 2, even without sound, you may be in for a CPC first

ESPECIALLY without sound

TFM · 16:36, 19 August 15

@Velktron : So how quick is the 8088 in block transfers? The CPC can do it with 5 us per Byte.

Velktron · 18:14, 19 August 15

Quote from: TFM on 16:36, 19 August 15
@Velktron : So how quick is the 8088 in block transfers? The CPC can do it with 5 us per Byte.

Well, this datasheet claims that the formula for the 8088 is 9+17n cycles for REP MOVSB (moves bytes) and 9+25n for REP MOVSW (moves 16-bit words).

Let's consider n large enough so that you can ignore the small 9 cycle overhead.

For a 4.77 Mhz CPU without the CPC's memory access limitations, and without taking into account factors like memory segmentation etc. REP MOVSB would work out to about 280 KB/s, or 1 byte every 3.57 us. Quite respectable, all things considered. It would work out to 17.5 full-frame, non-delta update per second, with 16K video pages.

Using REP MOVSW, the figures would be about 190 KWords/sec or 380 KB/sec (!) with 2.62 us per byte and a proportionate increase in performance for fps. At that point however, other factors such as memory contention with other I/O devices or the old RAM just being slow as sh*t would limit performance, but there would be no other CPU instructions to fetch from RAM: it would be purely the CPU writing data by itself for as long as the instruction tells it to.

Don't underestimate the power of this instruction, not all CPUs have it, and it can make an impressive difference whenever a programming language supports it (e.g. memset, memcpy etc. all rely on REP MOVSx... on Intel).

Now, Trixter claims that this is the fastest method for copying and moving bytes on the 8088 (and on Intel, in general), so who am I to question that?

For the humble Z80, the closest equivalent instruction (LDIR) would have a formula of 16+21n, which would become 16+24n on the Amstrad due to the memory access pattern. So max performance would be 166.66.. KB/sec, or about 6 us (!), quite close to the 5 us formula you quoted, and that's with a single instruction that doesn't need to be intermixed with others.

Some people like Optimus mentioned that there are a few ways to write a byte asymptotically faster than that, but that MUST come at the cost of interspersing multiple instructions with data transfers, and not being able to use the same instruction to write more than a few bytes at a time.

Now, IMO this suggests that the same technique used on the 8088 and CGA would also work on the Amstrad with its Z80, being the CPUs and the graphic so similar and related.

TFM · 18:40, 19 August 15

Thank's for the information about the 8088, so 2.6 us is not bad, nearly twice as fast as the LDI of the Z80. Therefore on Z80 machines it's necessary to choose other ways. A PUSH can do a "write byte" in 2 us, that kicks the 8088's butt. Of course it needs smart algorithms, else the advantage is lost soon. In case of the "Bad Apple" it's perfect though.

News:

Would a "Z80 corruption" type of demo be possible on the CPC?