Conway’s Game of Life for CPC—very fast+many features coming! New: QuadLife! p.3

db6128 · 05:11, 06 December 12

Edit: This is now massively outdated, as it's a lot faster and I've also released several other versions, including most recently QuadLife (page 3) and also a MODE 1 version that I might carry on with at some point. Please see the later posts for up-to-date information.

[original post]
—
Well, now that several embarrasingly self-inflicted bugs are gone and 03:40 in the morning has arrived—here it is! the first and most basic version of my implementation of John Conway's Game of Life for the CPC, ticking along at what I estimate as a not-at-all-shabby rate of about 20000 cells, i.e. just under 2 generations per second, all in glorious double-height MODE 0 pixels, no less!

Perhaps before the following text sends you to sleep, regardless of your timezone, just RUN"AMSTRIFE on the attached DSK

By the way, the name is not necessarily final and, yes, probably very silly, but it's better than nothing!

Backstory: DISCLAIMER THIS BIT MIGHT BE VERY BORING SKIP IT IF YOU WANT

I spent ages—and lost far too much sleep and other advisable activities—playing around with Life and related cellular automata in more modern contexts as a (very amateur) hobby and way to learn more about programming. I started with a terminal-based version programmed in C as one of my earliest programs while beginning to learn that language. That eventually led to an implementation in C/SDL that eventually got pretty darn fast—but became so horribly hackish (hideous macros and bitfields everywhere) that I ended up deleting it outright. Next, I taught myself the basics of C++ and eventually put together an object-oriented variant that incorporated various fairly decent features (see below) and was running pretty brilliantly for a while. But... it started to get slow, seemingly suddenly and for no reason (perhaps it was my laptop...), and that put me off the idea for a while, especially after I broke my laptop (...could it see the future?). I'm now back on Windows (secured a laptop with W7 whilst I still could) but haven't bothered to set up SDL or anything else, or even a proper IDE really.

Couple all of that with my renewed interest and/or extreme procrastination in the world of the CPC, and this is what we get! I've been wanting to do this for a while, but my command of ASM is only just getting acceptable, and it's been quite tricky to work out the logic behind this on the Z80 and in a fairly optimised way. Having said that, most of the progress was made tonight—er, well, it's now the very early morning, so I mean last night...—which is definitely very encouraging!

Of course, this is just the most basic first stage and a proof-of-concept. I'm sure plenty of people have done similar things before. I hope to get it to a state where it stands out, though—not that it'll ever be Hashlife or anything, but—by optimising and expanding it as much as my abilities allow. Immediate plans include...

* quite possibly a MODE 1 version and thus a larger world, at least for the basic algorithm (see below) – assuming it doesn't get far too slow with all the extra RAM-ing and suchlike
* configurability, particularly including an editor for the initial pattern and in-game changes (reading keyboard input does not look fun... Come back, firmware; all is forgiven)
* a custom-made text-displaying routine and font with which to display statistics on request; real-time would slow things to a crawl and really just gets in the way, anyhow
* and then the actually fun stuff, ported from (mostly my memory of) my relatively successful C++ version, namely things like these:
—identifying just-born and just-died cells with distinct colours;
—implementing Generations, the variant in which cells have lifespans, which changes their interactions considerably;
—adding the option to give the world/field a topology, i.e. wrapping edges/bounded grids: toroidally, Klein bottle–esque, spherical, etc.
—implementing QuadLife, in which cells can have a user-defined number of up to four colours and inherit their colour from their 'parents'; and
—my proudest achievement so far since I started wasting my time with this back in the early days of struggling (and not sleeping) to teach myself C: something I nicknamed Inheritors, which combines Generations and the colour-inheriting system to create technicolour madness all over your screen.

With the emphasised caveats about how rudimentary this is aside, it is quite a bit faster than a completely naïve version would be and already supports completely customisable rules as to which numbers of neighbours allow cells to survive or cause them to be born. Having said that, standard Life is only of any real interest when using its classic rules (i.e. the ones set up here). I hope to have Generations working ASAP because it enables various combinations of rules, many of which produce really interesting behaviour. I could link to some examples, but I'll save that as a treat for you all, and it wouldn't be as good if you weren't watching it on a CPC, anyway.

Z80babble: My immediate plan was to leave Generations limited to a maximal lifespan of 16 generations, which would allow me to remain using one byte per cell as in the bog-standard version of Life you see before you. However, I am seriously tempted to store cells in pairs of bytes and plough through the world at a ridiculous pace using my new best friend, SP >:-D That would also be necessary for more 'complex' 'algorithms' (scare-quoted because I don't want to mislead people into thinking I'm any good at maths...) such as Inheritors.

Anyway, I hope you enjoy this. Please give me any feedback or suggestions you have! And again, please stay tuned for many—and massive—updates.

db6128 · 03:47, 08 December 12

My expirements with MODE 1 have showed that it is not worth it: even without double-buffering, it doesn't allow much extra size, due to the memory required rapidly increasing with the dimensions of the grid - and the relatively small increase in the size of the grid is almost complete offset by the smaller pixels meaning that I could only fill just over a quarter of the screen.

So, good old double-height MODE 0 it is! I've maxed that out: from my initial post, it's now up from 128x96 to 128x110, and I have just &1000 bytes of memory free for code and data. I guess I can move single-use code and data into areas that will subsequently be overwritten. I might need to remove the double-buffering for certain algorithms/variants; having said that, it doesn't look too bad with a single screen; sure, you see a line moving up as the cells are updated, but it's not really distracting, actually quite cool in its own way.

Speaking of double-buffering, though, I wonder if anyone has an idea about this: when I have it enabled, there is slight flicker, seemingly between the two screens, at the bottom of the grid (only visible if it's populated at the time, of course). Why could this be? I can't think of anything I'm doing wrong that would cause it, but then I can't see why it would just happen either.

I've attached the slightly updated (126x110, etc.) version if you can offer any thoughts about the above issue, and just for anyone who's interested. There's a lot more to come from this, but I'm having a little break for a couple of days. . . Although I said that yesterday, and so today I ended up . . . just trying to program something different instead.

Axelay · 07:09, 08 December 12

Quote from: db6128 on 03:47, 08 December 12
Speaking of double-buffering, though, I wonder if anyone has an idea about this: when I have it enabled, there is slight flicker, seemingly between the two screens, at the bottom of the grid (only visible if it's populated at the time, of course). Why could this be? I can't think of anything I'm doing wrong that would cause it, but then I can't see why it would just happen either.

I've had a bit of a look at what I presume is the point you are doing the double buffering, and I think the problem is you have not waited for vsync with the swap to &4000 after your screen copy.

I've attached a screen grab from Winape halted at the breakpoint I inserted just before your code swaps to the screen at &4000. The horizontal dotted line Winape has put on the screen shows that at the point you are telling the CRTC to now use &4000 as the visible screen, the beam is not quite halfway through refreshing the monitor. More importantly, the CRTC is in the middle of displaying the current screen, and it will not reflect a change to the screen base address until it begins a new screen. So potentially, your code will go on to start modifying &c000 while the CRTC is still pointing to that area of memory for the current screen. I believe that is what is visible briefly at the bottom of the screen.

Also, I hope you dont mind a comment on your screen copy. You have used a short LDI string, I guess because LDI strings are faster than LDIR? But your LDI string is too short if speed was the reason for using it. LDI is only 1 NOP faster than LDIR per byte, and with just 8 LDIs, you only have 8 NOPs saved in that loop compared to LDIR, but your loop structure with two counters is at least 8 NOPs by itself, so there is no speed gain with the LDI list that short. Although if I am not mistaken your copy completes with HL=&0000? If that's correct then a faster loop would be to replace both the A and IXH counter checks with LD A,H / OR A,L / JR NZ,Loop after your LDI string.

db6128 · 16:09, 08 December 12

Hi Axelay, thanks for a great reply! I have been heavily editing this post (as is characteristic of me) as I think more about your suggestions.

QuoteI've had a bit of a look at what I presume is the point you are doing the double buffering, and I think the problem is you have not waited for vsync with the swap to &4000 after your screen copy. [...] just before your code swaps to the screen at &4000 [...] the beam is not quite halfway through refreshing the monitor. More importantly, the CRTC is in the middle of displaying the current screen, and it will not reflect a change to the screen base address until it begins a new screen. So potentially, your code will go on to start modifying &c000 while the CRTC is still pointing to that area of memory for the current screen. I believe that is what is visible briefly at the bottom of the screen.

This is fantastic - thanks very much!

In fact, now that I think about it, that would only have started after I changed my vertical counter to run negatively over the scanlines instead of from top to bottom - so that I could use DEC C:JP NZ instead of DEC C:LD A,C:CP 110 - which was an infinitesimal optimisation at best.

Thus, the new rendering frame began at the bottom of the screen - just in time for the CRTC to see it, I presume, as per your insights. Actually, I might have been looping backwards even before I increased the screen size, but if so, I assume that the lowest line was too high for me to notice the problem.

Changing the loops back to incrementing from 0 to 109, a.k.a. rendering from top to bottom, has everything working perfectly again!

That's just the timing of the refresh being kind to me, though! For a moment I was concerned that I might have to do two flybacks or something, or that there'd be no way around it. But after what you said, I realised that it seems I don't need to do even one V-sync, if the CRTC won't apply the new address until the next frame - certainly, removing the wait does not seem to harm anything. So, that's a second good hint!

However, I wonder if something like that will re-emerge when I implement more complex algorithms that might change this rather precarious balance between my code and the (however real or virtual) electron gun..

If I wanted to be really obsessive about trivial optimisations, I suppose I could store my table of scanline addresses in reverse order and go back to decrementing . . . but insanity lies that way, I think, considering how minor the outer loop over the lines is. In fact, I might even see if I can take another slight hit to speed but gain some valuable memory by removing the table and just incrementing the address as I go along. At this rate, an extra 512 bytes might be a valuable commodity soon! Edit: Yes, it seems that adding manually will take only one more NOP but remove the need for that table - which I have just enthusiastically deleted. Ah, I love finding new tricks! :V

QuoteAlso, I hope you dont mind a comment on your screen copy. You have used a short LDI string, I guess because LDI strings are faster than LDIR? But your LDI string is too short if speed was the reason for using it. LDI is only 1 NOP faster than LDIR per byte, and with just 8 LDIs, you only have 8 NOPs saved in that loop compared to LDIR, but your loop structure with two counters is at least 8 NOPs by itself, so there is no speed gain with the LDI list that short. Although if I am not mistaken your copy completes with HL=&0000? If that's correct then a faster loop would be to replace both the A and IXH counter checks with LD A,H / OR A,L / JR NZ,Loop after your LDI string.

I can't complain about constructive comments and suggestions, really; delving into someone else's machine code, especially mine, is an effort that I think should usually be appreciated - and anyway, I think everyone here is too nice to steal anything!

I believe my calculations for this were a bit off before, but the unrolled variant should still be slightly faster (or maybe my concept of micro-time is a bit off), about 15 ms, than a simple LDIR.

Your suggestion about HL is very clever - at first it seemed like the same 16-bit loop counter 'trick' could still win (thankfully the added overhead of using IXH was minimal due to its low value) if I increased the level of unrolling - but then I realised that my sums had been wrong (again...) - your version seems to be slightly faster, saving me about 6 ms! Which more than outweighs the puny 1 NOP per scanline additional cost of summing to the next scanline on the screen manually instead of using the table.

Here are my latest sums:

Code Select

; simple
ldir ; ~&7800x6 = 184324 NOPs / 184 ms

; unroll x8
; thus outer counter must be &7800/&100/8 = 15
ldi ; 8x5 = 40 NOPs
dec a:jr nz ; 4 = 44 x 256
dec ixl:jr nz ; 5 * 15
; = 44x256x15 + 5*15
; = 169035 NOPs / 169 ms

; unroll x12 (16 is not divisible into (&7800/&100))
; thus outer counter must be &7800/&100/12 = 10
ldi ; 12x5 = 60 NOPs
dec a:jr nz; 4 = 64 x 256
dec ixl: jr nz  ; 5 * 10
; = 64x256x10 + 5*10
; = 163890 NOPs / 164 ms

; Axelay's suggestion
ldi ; 16x5 = 80 NOPs
ld a,h ; 1
or l ; 1
jr nz ; 3
 ; = 85 * 1920 = 163200 NOPs / ~163 ms

So, you're now a contributing author!

(Incidentally, even though there's no point changing back, I wonder if these would have been enough of a gain to stop the flickering, haha)

Btw, as for unrolling, that's a pretty nifty CLS and memory clear at the beginning, right?

It felt even more epic when there was just one gigantic contiguous area of memory to clear! But that layout had the corollary that my screen and data were not adjacent, meaning that I had to do two LDIRs per frame - so, out of a single-time loop and one that happens ever frame, it was obvious which one would have its way with my RAM layout.

Really appreciate the great comments. In particular, it might have taken me a while, perhaps forever, to realise how I'd messed up the screen refreshing, so that's brilliant. Thanks a lot!

Edit to add: I might as well attach the latest revision incorporating all the changes I've mentioned/rambled at length about:

Axelay · 09:42, 10 December 12

Quote from: db6128 on 16:09, 08 December 12
So, you're now a contributing author! (Incidentally, even though there's no point changing back, I wonder if these would have been enough of a gain to stop the flickering, haha)

...But I guess I'm hoping you eventually find a way to use both screens as buffers so that copy isnt needed at all.

Quote from: db6128 on 16:09, 08 December 12
Btw, as for unrolling, that's a pretty nifty CLS and memory clear at the beginning, right? It felt even more epic when there was just one gigantic contiguous area of memory to clear! But that layout had the corollary that my screen and data were not adjacent, meaning that I had to do two LDIRs per frame - so, out of a single-time loop and one that happens ever frame, it was obvious which one would have its way with my RAM layout.

Yep, the SP is great for 'misusing'.

Most or all of my previous projects feature using the stack in some way other than 'normal', Sub Hunter especially, though I didn't think of that particular approach myself, that came from Operation Wolf via a comment about it from arnoldemu.

Quote from: db6128 on 16:09, 08 December 12
Really appreciate the great comments. In particular, it might have taken me a while, perhaps forever, to realise how I'd messed up the screen refreshing, so that's brilliant. Thanks a lot!

You're welcome. I look forward to seeing how this, um, evolves.

db6128 · 22:59, 10 December 12

Thanks to everyone who's Liked this

Quote from: Axelay on 09:42, 10 December 12

...But I guess I'm hoping you eventually find a way to use both screens as buffers so that copy isnt needed at all.

See, the reason I don't use the screen as the buffer is that a simple on/off format is (theoretically) inefficient and also that I've planned to store various other things once I add more features/variations.

I suppose that, for the simplest version where cells indeed can only be on or off (which I might find out is the only one that can run anywhere near fast enough!), it might be the case that the extra speed gained (again, theoretically) from the current type of storage is outweighed by the large copy, and so just using the screen(s) might be best. So now you've made me consider creating an alternative version to test that some time!

Actually, sorry

but I've already changed the branch after the LDIs - it's a byte shorter, 2 NOPs faster, and not dependent on the final address(es) to use JP PE,label instead.

QuoteYep, the SP is great for 'misusing'. Most or all of my previous projects feature using the stack in some way other than 'normal', Sub Hunter especially, though I didn't think of that particular approach myself, that came from Operation Wolf via a comment about it from arnoldemu.

It's really handy, too handy to leave in its boring normal role! Aside from CLSing, the other things I've used it for so far are for ploughing through addresses of scanlines in sprites or to store commonly used 16-bit numbers (e.g. the offset needed to go up or down a line). I try to avoid subroutines, so the SP is rarely needed for anything else, in which case I can save and restore it; at all other times, it would be sitting doing nothing if I didn't give it a task, wasting that valuable ADD HL,SP instruction and forcing me to use normal registers instead.

db6128 · 14:30, 11 December 12

Another minor update and DSK, really just as a proof-of-concept to show how fast it's capable of running without double-buffering the screen – which, for this program, requires copying its entire 16 kB due to the major differences between frames – and also due to a couple of quite significant (and in retrospect, obvious!) optimisations I made. These speed-ups have the following results:

* Disabling double-buffering and hence copying of the screen saves 85 ms per generation.
* More clever (translation: less stupid) mathematics save 3 NOPs for every cell, which, over the current 126×110 grid, is almost 42 ms per generation!
* Different improvements save 2 NOPs for every cell that changes its state at all (on to off or vice-versa); if we assume (with no real mathematical basis : P) that this includes 25% of them, that would increase the saving almost to 50 ms.

So, in total, this version is faster by about 130 ms per generation than the last one I posted. I think that's quite evident as you watch it. I'd estimate it does at least 2.5 generations per second, maybe approaching 3, which is faster than the ~2 achieved by the same source but with double-buffering enabled.

In fact, compared to the first version I posted here, which had a sizeable number of suboptimal parts, the double-buffered variant of this updated version runs slightly faster – despite the fact that it now has 14 more rows!

This is now at a stage where it's almost fast enough that the real-time rendering is not too ugly/distracting and therefore double-buffering is not completely necessary. However, I don't really want to remove the option (yes, you'll be able to choose, once I get far enough to have a menu and so forth!). Firstly, it looks nice – if you can tolerate the reduced speed – and more professional; I try to imagine what the 'lay(wo)man user of the CPC' (a mythical creature nowadays, perhaps?

) would think, and I think they might prefer, maybe be impressed by, double-buffering! Secondly, once I put more complex algorithms in, the rendering might be too slow and thus make double-buffering important to keep things looking nice, or at least tolerable. Quite a conundrum. : P

My next plan is to add whichever of the following is most cooperative with my attempts to code it: Generations, which allows cells to have lifespans rather than just being on or off; or QuadLife, which allows cells to have one of up to four colours, which are inherited from their 'parents'. In relation to the previous paragraph, if I wanted to have >16 ages/states in Generations, which I'd quite like (although only after getting the <= 16 version working), I would need to disable double-buffering anyway in order to free an extra byte for each cell (and also need to decrease the vertical size of the grid).

Ideally, I'll have one program comprising several automata run by dedicated routines, each optimised for purpose, rather than one algorithm that tries to do everything and is unnecessarily slow on simple rules. But one thing at a time! Anyway, I have to stop CPCing for a few days and do some 'real life'... sigh

Prodatron: I just saw in the list of related threads that you made a version of Life for SymbOS. Thanks for clicking Like on my one! I was seemingly able to compile your source using WinAPE, but SymbOS (2, disc version, WinAPE set to 64+256 kB) tells me that it is "not executable in the SymbOS environment". Is it only for the next edition? Anyway, keep up the good work on all of your projects! I hope I'm not too humbled when I do see your version.

Axelay · 14:38, 11 December 12

Quote from: db6128 on 22:59, 10 December 12
See, the reason I don't use the screen as the buffer is that a simple on/off format is (theoretically) inefficient and also that I've planned to store various other things once I add more features/variations.

I suppose that, for the simplest version where cells indeed can only be on or off (which I might find out is the only one that can run anywhere near fast enough!), it might be the case that the extra speed gained (again, theoretically) from the current type of storage is outweighed by the large copy, and so just using the screen(s) might be best. So now you've made me consider creating an alternative version to test that some time!

Actually, sorry but I've already changed the branch after the LDIs - it's a byte shorter, 2 NOPs faster, and not dependent on the final address(es) to use JP PE,label instead.

Devastated!

One day I must have another look at those instructions like JP PE,addr again, I'm so used to using JR for the byte saving I dont usually think about the other options available with JP.

With the software buffer copy, well I'm probably just thinking in terms of game engines too much, the 8 vbls your buffer & screen copy takes seems staggering to me, while compared to the processing per pixel you have to do it looks to be comparatively modest.

Gryzor · 14:43, 11 December 12

Rubbish gfx, no sound. And a mode 0 game with only two colours? Pffft. Control is also so slow it seems your input has no effect at all.

Gryzor · 14:44, 11 December 12

...heheh, always loved Life implementations. Now that I think of it I might look for an Android version to just have around... mesmerising concept, and I like what you did there! I understand optimisation comes first, but your to-do ideas are very nice too...

How about a cell counter at some corner?

db6128 · 15:06, 11 December 12

Axelay: Yeah, it definitely hurts a lot having to do the copy! Even if I exclude the screen, it still burns.

But I can't envision another way. For my previous versions in C and C++, I considered combining the evolving pass over the grid with the copy, i.e. getting each cell's properties from one buffer, checking whether it needs to change, saving it to the other buffer, and eventually flipping the pointers to the two buffers (rather than copying) after the full grid has been processed. I settled on a separate block copy as that seemed much simpler and possibly a bit faster. Perhaps my reasoning about the speed was flawed for high-level languages, but on the Z80, if I'm not mistaken, flipping back and forth between the two buffers with every cell would slow things down massively.

Gryzor: Uh-oh, I've been found out! I thought I could just do a lot of PR and gullible people looking for the next Crysis would just give me their money without asking questions. I would have gotten away with it, too...

Thanks for the nice feedback! Optimisation can take a back-seat now*, as I'm quite happy with its current state, so features will be the name of the game. I agree that a counter would be good: in my C(++) versions, I always wanted to include (but never did due to having no idea how to make a GUI) statistics not only for total population and generation#, but also for things like the number of births, deaths, ageings, etc. in the last generation. However, on the CPC, I think these will have to be calculated upon request only (by comparing the two buffers, just before the copy), because to increment/decrement as I go along would slow things down a lot – not only because 16-bit numbers would be needed, but also because I'd probably have to use IX/IY. And we know what they're like! But yeah, there will be statistics of some kind. Thanks for the good suggestion.

* I'm sure most programmers know what Donald Knuth would say about your "optimisation comes first", but the Z80 doesn't leave us much choice for things like this!

Quotealways loved Life implementations

Next up, a CPC emulator running in Life running on a CPC? Haha, it's always impressive what people can do with Turing machines and the like when they put their minds to it.

Gryzor · 15:36, 11 December 12

I think you could 'play' Life in the original Sims, actually. It would be cool if someone made a cpc-emu mod running Life then. A very esoteric joke.

Prodatron · 00:46, 12 December 12

Wow, it's VERY impressive, how fast this is! Maybe this is the fastest GoL-Implementation on the CPC we have ever seen?!
Realising a GoL-based Turing-machine on the CPC comes closer and closer!

I didn't look at the code, but I wonder:...
- is it optimized for the 23/3 rule? (the original Conway rule for his Game of Life)
- so is the rule "hardwired" or would it be possible to configure other rules?
This summer I developed such an application on the Z80, too. It isn't optimized (and still bugy), and I only focused on the interface/options/figure library/rule settings/memory friendly etc.:

Maybe you can adopt some of the options and features, too, in your optimized version? If I can provide something, just tell me.

CU,
Prodatron

db6128 · 02:38, 12 December 12

I was just going to post another update, but now I have to fire a giant wall of text at poor Prodatron

QuoteWow, it's VERY impressive, how fast this is! Maybe this is the fastest GoL-Implementation on the CPC we have ever seen?!

Thanks very much! That would be much-appreciated coming from anyone, but especially one of the CPC wizards.

Despite searching quite a bit, the only other versions I could find were one by Nich of CPCGameReviews and other projects and a type-in from Amstrad Action back in 1989 (side-note: I very much like, and hope to take advantage of, how the Web Archive has PDFs of loads of issues). Neither seemed to have been designed with speed in mind.

I don't want mine just to be a barebones show of speed, of course, but being as fast as possible is an important starting-point for any later features, when we have so few MHz to work with - which isn't always a bad thing for forcing us to program well! I can't claim to be a genius in coding; it's only because both the algorithm is simple and the Z80 is limited that I have been able to figure out ways to save cycles, and it's been a really educational experience. More on that story below!

QuoteRealising a GoL-based Turing-machine on the CPC comes closer and closer!

Haha, I already proposed the idea of a CPC running Life, and that Life running a CPC emulator. Well, they say nothing's impossible!

QuoteI didn't look at the code, but I wonder:...
- is it optimized for the 23/3 rule? (the original Conway rule for his Game of Life)
- so is the rule "hardwired" or would it be possible to configure other rules?

Rules for birth and survival are looked up from tables, so they could be totally changed if the user wanted. However, I find that Life itself doesn't work very well with rules other than Conway's original one - more evidence for how good a discovery it was, I think! On the other hand, perhaps my main priority is to add support for Generations, in which cells have a lifespan longer than 1: If their count of neighbours goes to a 'death-causing' level, they start to age until they reach their maximal age and die, but during ageing they stop being included in the neighbour-counts of other cells but do continue to occupy space, preventing other cells from being born there. It changes the dynamics a lot and enables many different rules, some of which are really good. Check out the site for the old program MCell and the author Mirek's own rule Star Wars - it's totally fantastic

QuoteThis summer I developed such an application on the Z80, too. It isn't optimized (and still bugy), and I only focused on the interface/options/figure library/rule settings/memory friendly etc.:

Great; thanks for the info, and especially the video, since I couldn't find any footage/pictures, after mentioning it earlier:

QuoteProdatron: I just saw in the list of related threads that you made a version of Life for SymbOS. Thanks for clicking Like on my one! I was seemingly able to compile your source using WinAPE, but SymbOS (2, disc version, WinAPE set to 64+256 kB) tells me that it is "not executable in the SymbOS environment". Is it only for the next edition? Anyway, keep up the good work on all of your projects! I hope I'm not too humbled when I do see your version.

Speaking of which, the sorts of user-friendly features in your version are very similar to what I wanted to add to my previous versions in C(++) and SDL. Back then, I was halted by a total absence of knowledge about programming GUIs! Now, for the CPC, I want to have configurability but not let it affect speed. So, for example, I would only calculate and display statistics if the user requested them, not constantly/real-time. Ideally, there would be enough clock cycles to do things like that continuously, but there aren't! However, your version looks great - as does all of SymbOS, a very impressive achievement - and the options are eerily similar to what I would program if I could.

Thanks for the offer of info; I'll definitely let you know if I have any questions/ideas/etc. once my program starts coming together more!

Anyway, thanks again for the compliments

db6128 · 03:00, 12 December 12

(update at end of post)

Axelay: You'd think I'd remember this, having found out the hard way exactly the same before in my C(++) versions, but here goes: Copying to the future as you go along does not work, because obviously (again, only in retrospect) that'll overwrite cell 90 (for example) with old information after its future has just been altered by changes in cell 89, and so on. So, indeed, the entire buffer generated by the previous iteration has to be copied before one begins the next. I just found this out during my latest frenzied session of coding, which meant that work was a total a waste of time. Just like when I supposedly learnt the same lesson before!

But!

I removed pointless code that was wasting 3 NOPS for every cell! Admittedly, later and unrelated changes have cancelled out 2 of the 3 - but that's still a saving of 14 ms. Besides, I've also removed more pointless code that was wasting 3 NOPs for every cell that changed state . . . and . . . reallocation of registers has saved me another 4 for the same cells.

So that's a guaranteed saving of 14 ms every generation, plus a variable speed-up that rapidly scales up as the degree of action on the grid increases, with an saving of 7 NOPs per changing cell compared to the last version. I tried to estimate how many ms per generation that would add up to on average, but I have no idea what percentage of changing cells to use . . . So let's just pick a figure out of thin air and say it's 30 ms better. : P

Of course, that's hardly significant when compared to the overall time, and less significant after my previous and larger speed-ups - especially if using double-buffering, which is now included as DUBLBUFF.BIN (note: most previous versions already double-buffered, as I only disabled it see how much faster it was) . . . but it's still a nice saving! This must be close to totally optimal now . . . or at least close enough to stop wasting time trying

Overall, I'm quite confident that I've ironed out all the silly mistakes and found shortcuts almost everywhere that's possible - or worth doing!

So, I think I'll celebrate my recent successes with the Z80 by bumping this up to v0.1, which you can download below! Again, thanks to everyone for the compliments, hugely appreciated, and please stay tuned for lots of features in the future.

arnoldemu · 10:26, 12 December 12

I like all the optimisations you are doing.

db6128 · 13:57, 12 December 12

(Edit: slight update to the DSK: I was copying slightly too large a chunk of memory in non-double-buffered mode, after forgetting to change back something I was testing)

arnoldemu: Thanks; I'm glad you're enjoying reading

You'll like the next bit, then!

As it turns out - although not including the fixes for silly delays that shouldn't have been there in the first place, which have stayed fixed - the saving of 4 NOPs for changing cells was outweighed by the added overhead of 2 NOPs on all cells due to the register reallocation that made the quicker changes possible - i.e. using IXH as a loop counter, etc. Changing cells would have to represent half of the entire population during every generation to make that worthwhile, which definitely is not the case. It might be my imagination, but I think the difference is noticeable. . . again

So, although 0.1 should indeed have been a bit faster than 0.04 due to 0.04 having silly delays, this one should be markedly faster than both of the two previous versions. Of course, we're still talking about 10s of ms, but it all counts!

I'd say this algorithm is definitely as good as it gets and now worth moving on from to focus on features, because although I'm proud of the speed I've achieved, I've proved that any further attempts I make to speed it up . . . don't actually work!

Here's the latest DSK.

Axelay · 14:20, 12 December 12

Yes, it's been fun to follow the development.

Quote from: db6128 on 03:00, 12 December 12

(update at end of post)

Axelay: You'd think I'd remember this, having found out the hard way exactly the same before in my C(++) versions, but here goes: Copying to the future as you go along does not work, because obviously (again, only in retrospect) that'll overwrite cell 90 (for example) with old information after its future has just been altered by changes in cell 89, and so on. So, indeed, the entire buffer generated by the previous iteration has to be copied before one begins the next. I just found this out during my latest frenzied session of coding, which meant that work was a total a waste of time. Just like when I supposedly learnt the same lesson before!

So are you saying you need two copies of the 'current' cell map to produce a new generation?

db6128 · 14:25, 12 December 12

I just reuploaded the DSK of 0.11 due to it previously copying slightly too big a chunk of memory in non-double-buffered mode, which was left behind from testing. Sorry for spamming you all with DSKs :|

Axelay: I'm glad you've found it interesting! By the way, I've heard a lot about your games, so I'm going to repay the favour by checking them out tonight.

Yep, we need to copy the most recently generated future to the present, so they're both now the same, and then compute (write) the new future whilst leaving the present (from which it's read) completely unchanged.

Edit: This is at least the case for the particular way that I process the evolution. A more naive method would probably make copy-as-you-go possible. I chose my one because it saves a lot of time-wasting during the checking process. However, it does necessitate the full copy afterwards. On balance, though, it's totally worth it.

Prodatron · 16:42, 12 December 12

Quote from: db6128 on 02:38, 12 December 12Rules for birth and survival are looked up from tables, so they could be totally changed if the user wanted.

That's really good! As I could see Nich C. implemented it hardcoded (but that doesn't make a big difference for the optimization)

Quote from: db6128 on 02:38, 12 December 12However, I find that Life itself doesn't work very well with rules other than Conway's original one - more evidence for how good a discovery it was, I think!

Regarding the complexity of different "life forms" in such a world, this is really true! Though some of the other alternative worlds are producing also very interesting results (like the copy- or the labyrinth-world). But yes, you can spend much more time with Conway's 23/3 world.

Quote from: db6128 on 02:38, 12 December 12On the other hand, perhaps my main priority is to add support for Generations, in which cells have a lifespan longer than 1: If their count of neighbours goes to a 'death-causing' level, they start to age until they reach their maximal age and die, but during ageing they stop being included in the neighbour-counts of other cells but do continue to occupy space, preventing other cells from being born there. It changes the dynamics a lot and enables many different rules, some of which are really good.

I never saw this before, this sounds really interesting!

Quote from: db6128 on 02:38, 12 December 12Prodatron: I just saw in the list of related threads that you made a version of Life for SymbOS. Thanks for clicking Like on my one! I was seemingly able to compile your source using WinAPE, but SymbOS (2, disc version, WinAPE set to 64+256 kB) tells me that it is "not executable in the SymbOS environment". Is it only for the next edition? Anyway, keep up the good work on all of your projects! I hope I'm not too humbled when I do see your version.

Sorry, I missed this before. Did you assemble the "App-GameOfLife1.asm" (which is somehow the "make"-file, which includes the other ones)? Anyway I will attach the actual version as single files/DSK/source codes.
My version is clearly slower than yours one. I think the main reason is, I am using a bit-field instead of a byte-field. As my field is 64x64, only 512bytes (=64x64/8) are required for one copy of the field and 1KB for both fields together. But this makes it probably much slower, because of all this bit shifting/handling stuff.

You can find the two core-routines at:
CELCLC (calculates the whole field) and CELCNT (subroutine used by celclc, which counts the neighbours of one field)
The average time for calculating one cell is about 145 NOPs - just for generating it in the new bitfield (without the graphic output).
I wonder how much time is needed for the "core"-calculation of one of your cells?

CU,
Prodatron

db6128 · 19:33, 12 December 12

QuoteThat's really good! As I could see Nich C. implemented it hardcoded (but that doesn't make a big difference for the optimization)

In fact, in many cases, it can be quicker to look it up from a table, as I don't have to do multiple CPs/JRs – this is always faster when, as with S23, there are at least two possible 'yes' numbers and the first is not found, which occurs most of the time – and it will certainly be a lot faster for rules with more possible matches.

QuoteI never saw [Generations] before, this sounds really interesting!

Yep, you're in for a real treat when you check out some of those rules.

QuoteDid you assemble the "App-GameOfLife1.asm" (which is somehow the "make"-file, which includes the other ones)? Anyway I will attach the actual version as single files/DSK/source codes.

Yeah, I started from that file, and WinAPE said everything worked (although I had to remove some weird characters at the start of each file), but SymbOS wouldn't execute the resulting file.

QuoteMy version is clearly slower than yours one. I think the main reason is, I am using a bit-field instead of a byte-field. As my field is 64x64, only 512bytes (=64x64/8) are required for one copy of the field and 1KB for both fields together. But this makes it probably much slower, because of all this bit shifting/handling stuff.

It's a great example of the classic trade-off between speed and memory! I can see why you used this method, which is very efficient on RAM – very nice in itself, and important for SymbOS, I imagine. Me, I'm monopolising almost the entire memory just to save NOPs.

QuoteThe average time for calculating one cell is about 145 NOPs - just for generating it in the new bitfield (without the graphic output).
I wonder how much time is needed for the "core"-calculation of one of your cells?

I guess I'll do some sums later and let you know. Whatever the time, it'll be even better now than it was before, as I'll explain in my next post.

db6128 · 19:40, 12 December 12

I think I found one more silly mistake, although it apparently wasn't messing everything up as one would imagine it would, so maybe something else was counteracting it...?! Whatever, it's gone now.

And folks, I've been doing a little something that I'm going to christen (hopefully I'm the inventor of this phrase

) 'unrolling branches': I duplicated my end-of-loop code into each place where it's needed, rather than having to jump to it from various other points; and I also changed some branches to take advantage of how JR is 1 NOP faster when the condition is false. A full explanation would be a bit complex and boring, but as far as I understand the relevant timings, this:
* saves 3 NOPs for dead cells that are born
* saves 2 NOPs for dead cells that stay dead
* saves 2 NOPs for alive cells that die
* costs 2 NOPs for alive cells that survive, one of which is due to the added distance meaning that I have to to use DEC B:JP NZ instead of DJNZ

Considering that most cells are dead and stay dead, 2 NOPs each is bound to be a massive saving in total. For dead cells that are born, there is also a bigger saving of 3 NOPs. Also, although there's an added NOP for live cells that stay alive. I imagine that, in an actual game, many live cells die – either as part of complex, ever-changing patterns or as part of oscillating shapes. And guess what? Live cells that die save 2 NOPs!

To analyse it a little more: The slight cost of live cells that stay alive (2 NOPs) would only hypothetically matter on a grid in which most cells had settled into still-life or oscillating patterns; in either case, it would be outweighed by large speed-ups from the majority of cells that will be dead and stay dead (save 2 NOPs), and in the case of oscillators, from the changing cells (save 3 NOPs from dead to alive or 2 NOPs from live to dead).

So, overall, this is a massive boost to speed. Although, yes, also to my greedy usage of RAM.

Earlier, after I realised that I could unroll the branches like this, I also realised that (A) it's 2012-12-12 and (B) the successful revision would be v0.12. Pretty cool! I have given it an even more interesting version number to celebrate.

Please have a look if you are interested; I think the gain in speed is especially significant this time!

The files are MAXSPEED.BIN (no double-buffering) and DUBLBUFF.BIN.

Prodatron · 19:50, 12 December 12

Great stuff!

Would a 6502 be able to perform the same??

Btw, it's quite easy to measure the time:
- set a breakpoint in WinApe at the beginning and at the end of the routine
- when reaching the first breakpoint, clear the timer (directly below the registers)
- when reaching the second breakpoint you will see the amount of NOPs
(maybe you already knew this)

CU,
Prodatron

db6128 · 20:20, 12 December 12

Thanks for the positive feedback! And for the useful hint, which means I can now answer your question:

Just from running through the first 40 or so generations of the non-double-buffered version in WinAPE, including the copy of the buffers, each seems to take approximately 440–460 ms, so about 2.222... per second. This suggests that my previous estimates were maybe a bit high, especially as the versions only get slower as we go back in time.

Since the copy takes 74337 NOPs, and assuming an average 450000 NOPs per total generation, that means the processing of each cell takes an average of (450000-74337)/(126*110) = 27.1 NOPs.

Of course, these figures are not necessarily reliable for further-evolved random patterns, or for user-designed patterns (something that I am, of course, going to make possible) – however, if anything, both are likely to be faster than my numbers suggest, as the number of living and especially constantly surviving cells will probably decrease with time and/or in engineered patterns, which would speed things up even more, as I described earlier. In other words, we can't lose

Axelay · 13:02, 13 December 12

Quote from: db6128 on 14:25, 12 December 12
Axelay: I'm glad you've found it interesting! By the way, I've heard a lot about your games, so I'm going to repay the favour by checking them out tonight. Yep, we need to copy the most recently generated future to the present, so they're both now the same, and then compute (write) the new future whilst leaving the present (from which it's read) completely unchanged.

Hmm, I guess considering the number of SET/RES instructions you might need to add to avoid copying the visible screen, the copy may well be the faster option after all!

With respect to my games, if you are interested (mad?

) then Edge Grinder also has commented (after a fashion) source.

News:

Conway’s Game of Life for CPC—very fast+many features coming! New: QuadLife! p.3