News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu
avatar_PulkoMandy

I tried to port Smalltalk to the CPC. It didn't work out in the end.

Started by PulkoMandy, 15:32, 24 February 24

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

PulkoMandy

Next blogpost in the series

I got the VM to run  ;D

Unfortunately, it only runs a handful of methods before running out of space in the new indirection table that I added  :(

Time to let this rest for a few days and think about the best way to move from there.

PulkoMandy

Well it was a little more than a few days...

Someone mentionned this in another thread so it's time for an update :)

I resumed work on the project last week. As mentionned in the previous post, I have removed again the indirection table, and went for objects directly referencing each other. This means:
  • The pointers/references are 3 bytes (bank + address)
  • The garbage collector will have a bit more work to do to update all references when moving an object

I also updated to the latest nightly build of SDCC which has improved code generation a bit. This allows to fit more things in the ROM. The generated code is still not great, compilation times are ridiculous (at least 45 minutes to build 16K of code, up to 5 hours if I ask the compiler to optimize a bit more). Modern software...

The current result is:

  • The interpreter runs far enough to display the smalltalk prompt
  • I can enter an expression in the prompt
  • The compiler starts converting the expression into bytecode, but it runs out of memory (512K of banks + 16K of main memory are reserved for the object heap) before managing to compile and execute it (I just tried to run something simple like 5 + 3)

There is no garbage collection yet and the interpreter creates a lot of temporary objects, which normally would be almost immediately deleted by the garbage collector. So the memory usage will be massively reduced when I implement that.

But I prefer to get "5 + 3" fully working before I move on to implementing garbage collection, so that I have a simple test scenario to test with. I have some ideas to achieve this. Also I should start rewriting the code in assembler directly (one function at a time) to make it smaller and more efficient. And even before doing that I have some optimizations to do in the C code that I apply manually (each new development requires finding a way to save space in the ROM for it).

At least initially, I may move the garbage collector to a separate ROM. The interface between the two parts of the code is currently 4 functions, but it can probably be reduced to 1 or 2 with extra parameters. So that provides a nice interfacing point to split the ROM in two. And maybe later I can optimize things enough to consider merging them again.

You can see some screenshots and day-by-day notes and milestones on my Fediverse account:
https://mastodon.tetaneutral.net/@pulkomandy/112107254263308680

PulkoMandy

After some bugfixes and optimizations, I managed to get it to compute 4 + 3 and print the result, and then go back to the prompt. This means the interpreter is fully working.

It is also very slow and it needs almost all of the 512K of memory to compute this.

The next step is writing a garbage collector that works with banked memory so that I can avoid eating all the RAM so quickckly. I have some ideas about it.

Then, in no specific order:
  • Rewrite it all in assembler so it is faster
  • Wire in the function to call an editor (maybe orgams, maybe protext, maybe something else?) so you can actually edit methods, add new classes, etc
  • Include the disassembler method from Andy Valencia's version of Little Smalltalk so you can edit existing methods (disassemble bytecode, edit, reassemble bytecode)
  • Implement file output so you can save your working session and reload it later
  • Write some classes to access CPC firmware and hardware
  • Write some example programs and so on

Curently it fits in a single ROM, but I will need a second one for the garbage collector (at least initally, until everything is rewritten in assembler) and the "image" file is under 50KB. Memory usage to get to the prompt is also around 50-60KB, so maybe it can work on a stock CPC 6128, but it will be quite bad once you start actually doing things.

In theory it's also possible to use mass storage as swap and load things in/out as needed. But I hope with 512K of RAM it is not needed for most projects, and I'm more interested in making it reasonably fast than making it use more space.

Prodatron

I never had a look at SmallTalk, but I have a little question, I hope it isn't too stupid:
AFAIK the first SmallTalk-80 was developed on the Xerox Alto. This machine had a 5,8MHz 16bit TTL-CPU and 128-512K ram.
As it sounds, that it is really not easy to get SmallTalk running on a machine with seeming similiar technical data I wonder if they used a different SmallTalk for the Alto, or was its CPU very different and optimized for running it?

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

PulkoMandy

Not stupid at all, indeed that is what I'm trying to understand. How does Smalltalk manage to run so well (including a GUI) on the Alto, which doesn't seem much more powerful than the CPC?

First of all, the CPU. It runs at around 6MHz but it does a lot more than a regular CPU. It is microcoded, meaning assembler instructions are implemented using an even simpler low-level language. The microcode runs not only a CPU as we understand it, but also other tasks such as feeding data to the display, managing the disk drive, etc (things that on the CPC are offloaded to the FDC and CRTC). So, in reality the CPU does not run so fast.

However, an interesting feature is that the Microcode is loaded in a special RAM, and is reprogrammable. As a result, Smalltalk can make some things faster by implementing them in microcode. The display drawing also takes advantage of this, by having a "blitter" implemented in microcode (to perform various copies and transformations of display data, including text rendering and so on).

Next, the memory. There is 128K of RAM in a base Alto. Half of it is for the display. Once you have the Smalltalk interpreter running (including the low level OS layer to access the disk, network, etc), there is only about 20K of RAM free for the Smalltalk objects. This is not much. What they did is constantly swap objects in and ouf of memory and write them back to disk. If you watch some demonstrations running on the real hardware, you can hear the disk seeking all around everytime they do anything with the OS. This also means they have to keep a table in RAM to know if an object is loaded in RAM, or otherwise, where to find it on disk. This table is several kilobytes big and uses up a lot of the available RAM. So, really, it's as if the RAM is used as a cache, and the real RAM is the hard disk.

In some cases they even turn the display off to free more time for the CPU to run faster as well.

Overall, it runs quite slowly on the Alto. But, compared to other environments available at the time, the main advantage is you can edit things live while the system is running. You don't have to recompile or even restart your program. So, overall, the development cycle is still faster, but it is not exactly a fun platform to use.

Finally, what was released out of Xerox (Smalltalk-80) is a later version that did not run on the Alto, but on later machines like the Dorado. These are quite a bit bigger and faster. So, to get a good impression of what it was like on the Alto, you have to look at Smalltalk-76.

Interestingly, there is a Smalltalk-78 that was ported to a 8086 based machine with 256K of RAM. This one does not use object swapping (there were only floppy disks, no hard drive, so that would be too slow). There is an "image" of it available, however, this version was never released publicly back in the days, and is not very well documented. But there certainly are interesting ideas there to save memory. Also, it has another problem: Smalltalk-80 was the first version to be converted to ASCII. The previous ones use all kind of strange symbols ("looking eye", "open colon", "pointing finger", "up arrow" and "left arrow", ...) which I find a bit confusing. Also, the machine used 3 8086 CPUs to do all its tasks, and it was still quite slow.

That's why I went with Little Smalltalk instead. This is an ASCII version of Smalltalk but removing all of the GUI, it ends up being quite a bit smaller than Smalltalk-80. My current code is not optimized at all, I wanted to get it running first. Since the interpreter is very "hot" code (run over and over again), optimizations in key areas, even saving just a microsecond or two, can have a huge impact. This is both for low-level optimizations (rewriting things in assembler) and also high level ones (adding a "method cache" so that if you call a method several times, the lookup of the method in the class and parent classes isn't done each time ; avoiding allocating memory when you don't need it, ...).

Finally, on memory usage: Smalltalk on the Alto used 16 bit object references. I went with 24 bit (bank number + 16 bit pointer) so all my objects are 50% larger than on the Alto. Also, my code is constantly bank switching when accessing different objects, it may make sense to cache a few of them in main RAM. For example: the bytecode of the currently running method, its local variables, and its execution stack, as well as the current object and class. But then, of course, this introduces cache invalidation problems and I have to make sure the cached main RAM version is in sync with the original. But it is probably worth the effort.

Prodatron

Thanks a lot for this great explanation!
I was already wondering about what advantages could have these 70ies TTL/bitslice CPUs by using tons of microcode.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

HAL6128

Xerox had invented a lot of wonderful ideas in that time ago. Impressive.
...proudly supported Schnapps Demo, Pentomino and NQ-Music-Disc with GFX

Prodatron

Yes, as the father of the GUI and Ethernet already in 1973, but also somehow of the personal computer, etc., the Xerox Alto was one of the most impressive IT projects I've ever seen.
For me it's very fascinating, that @PulkoMandy is working on the Alto-related Smalltalk project.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

PulkoMandy

Quick update: the garbage collector is working. So now the ramis not getting filled up as fast as before.

Benchmarks:
30 seconds from starting to getting the prompt
2 minutes to compile and execute "4 + 3"

Now it's time to rewrite the C code in assembler to make it acceptably fast

GUNHED

Great! This wonderful thing is proceeding! Good luck with your further work.  :) :)

I would like to see it one day also running for FutureOS. Since you seem to code close to hardware, it should not be a big deal to convert the version for native OS.
http://futureos.de --> Get the revolutionary FutureOS (Update: 2024.10.27)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

PulkoMandy

The firmware is used for text output and keyboard scanning, as well as amsdos for disk access. I think this should be easy to convert to another system.

I will later also need to launch a text editor to edit some text loaded in RAM, and get the edited result back.

I have no idea about Future OS memory layout to see if that would cause me any problems.

GUNHED

Quote from: PulkoMandy on 07:28, 15 October 24The firmware is used for text output and keyboard scanning, as well as amsdos for disk access. I think this should be easy to convert to another system.

I will later also need to launch a text editor to edit some text loaded in RAM, and get the edited result back.

I have no idea about Future OS memory layout to see if that would cause me any problems.
Yes, for text output FutureOS functions can be of help. However they deal more with pages of text (not archaic command lines and text flow like in CP/M output. But that shall no problem).

Text editor: FutureOS contains a function for editing text (defined length, defined lower and upper character). Also FutureTex can be involved as app.

The usable memory is the following: 0-&B7FF. For disc I/O buffers are needed. RAM from &0000 to &AFFF can always be used. And &B000 to &B7FF can be used as buffers.
And for expansion RAM (&4000-&7FFF) management there are a lot of functions, using up to 4 MB (if connected).

To do an implementation for FutureOS I can do all that part. In your source code there would be just some 'IF' and 'ENDIF' commands, to be able to assemble for a different target OS.

An further advantage is that FutureOS leaves all RSTs for you and you can use all registers (2nd register set!).

Let me know the time you think is right, and I can do the implementation.

Meanwhile I really with you luck and success with this great and ambitious project!  :) :) :)

http://futureos.de --> Get the revolutionary FutureOS (Update: 2024.10.27)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

zhulien

If you used packed far addresses you will be able to fit 1/3 more object pointers in your 16k object heap.

PulkoMandy

I'm not sure what packed format you would use. Currently this needs at least 5, probably 6 16K pages of memory, so I can't store them in 2 bits alongside with the 14 bit address. And that's just with the initial set of objects loaded, not having written any smalltalk code myself yet. Any application will need a bit more.

Also, the bank number byte is used for a few other things: set to 0 for integer values, 1 for individual character values (not strings, theyeare stored in a more compact way), and I may need some more of these later.

So there's no way I would fit all this information into only two bytes

zhulien

Here in the RAM tab outlines two schemes that might be possible to use for 16 bit packed far addresses

https://docs.google.com/spreadsheets/d/1XgRVlh27K_C0-gMtroMhN8lK9mAQxVQg1x-3M42kBYo/edit?usp=drivesdk

You should be able to support 4mb within 16bit packed addresses if you align to 64byte boundaries.


"OPTION 2:

address:   xxxxxxyyyyyyyyyy
capacity:   x = 64 x 64kb = 4mb (note: 1kb wasted per 64kb) so effective almost 4mb
alignment:   y = 256 x 64byte blocks per 16kb (1st block is header)

stats:     

64kb has 1024 blocks, 4 entries in the MainRAT since there are 4 16kb banks
256kb has 4096 blocks, 16 entries in the MainRAT since there are 16 16kb banks
512kb has 8192 blocks, 32 entries in the MainRAT since there are 32 16kb banks
4mb has 65536 blocks, 256 entries in the MainRAT since there are 256 16kb banks

pros:     

address fits within 16 bits
still works with IX and IY registers
lots of blocks for certain types of applications

cons:     

slower to translate due to 64 byte alignment
ram allocation table (RAT) is larger

RAT per 16kb:"

PulkoMandy

An address in a 16K bank needs 14 bits if you don't do any alignment.
With 512K of banks, you need 5 bits to store the bank number (2^5 = 32 banks of 16K)
That would be 19 bits. So you need to remove 3 bits from the address and align everything to 8 bytes.

OK, sure, that can work. But it will make everything much slower (a lot of bitshifting needed to decode things into an usable address) and more complicated, and waste up to 7 bytes per object.

May be fine for a generic memory allocator, but for smalltalk, a typical object size is 2 to 10 cells at most (and I can save one byte from that)

With 3 byte cells as I have now, this is 5 to 29 bytes.
With 2 byte cells and rounding up to 8 bytes, it would be 8 to 24 bytes.

And also, as I mentionned, the extra bits in the 3 byte address are used for other things. So I would probably need one or two bits more, because I need that to store some things that are not pointers (integers, chars, possibly more in the future). That would increase the allocation alignment requirement to 16 or even 32 bytes. Not a good choice for Smalltalk, which does a lot of very small allocations and a lot of pointer dereferencing, even if it would work for many other cases where the developer has control of the memory layout, and will probably do larger chunks of allocation and store related data together inside them, possibly with normal pointers for most of it except when it points to another allocation block.

zhulien

But 512k gives 8192 allocations with above... 4mb 65536. Is thst still too little for a typical smalltalk application?  Of course byte aligning wastes space, but that is a lot of memory for a z80.

What would a typical number of objects be within a smalltalk application? If it is so memory block intensive does a smalltalk application spend more time allocating memory than program logic?

PulkoMandy

Just loading the base image create more than 2000 objects/allocations, filling up about 50K of memory. Then, running the interpreter creates a lot more, since every method call requires allocating at least a context (and a few more things if there are argumets), and every operation (including things like adding two integers) is a method call. So, yes, currently most of the work is in the allocator.

There is a way to improve this by having some objects allocated elsewhere, especially method contexts can, most of the time, be handled like a stack, and are short lived. But there are some cases where this doesn't work due to some quirks in the language, so, sometimes these objects will first be allocated in the stack, and then moved to the heap. This stack can be in main memory, which will have the advantage that accessing these objects will not need any bankswitching, the interpreter can benefit from this.

In the current situation, at any given time there are few objects really in use, but the system goes quickly through all the memory while allocating and immediately releasing objects.

The main question is: how long does an assembler routine take to "unpack" such a 16 bit pointer, back into a bank number to write in the gate array and an offset in the bank? And how long to perform the reverse operation? In smalltalk, these will happen a lot, and would further slow things down. And right now, for the thing to be practical, I need speed more than I need extra memory space/more compact objects. So I went with the larger but faster way

Prodatron

Quote from: PulkoMandy on 09:05, 17 November 24The main question is: how long does an assembler routine take to "unpack" such a 16 bit pointer, back into a bank number to write in the gate array and an offset in the bank?
I was just curious:

;input      HL=8byte aligned 512K address (rrrbbaaa aaaaaaaa)
;output    #4000-#7fff = 16K block, HL=address in 16K block
;destroyed  AF,BC

        ld a,h          ;1      a=rrrbb...
        rrca:rrca      ;2      a=..rrrbb.
        ld c,a          ;1
        and #38        ;2
        or #c4          ;2
        ld b,a          ;1      b=11rrr100
        ld a,c          ;1
        rrca            ;1
        and #03        ;2      a=000000bb
        or b            ;1      a=11rrr1bb
        ld b,#7f        ;2
        out (c),a      ;4
        add hl,hl      ;3
        add hl,hl      ;3
        add hl,hl      ;3      hl=..aaaaaaaaaaa000
        res 7,h        ;2
        set 6,h        ;2      hl=01aaaaaaaaaaa000 -> 33 NOPs
        ret


33 NOPs for unpacking and bankswitchting, maybe someone can optimize this.

You are probably right.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

zhulien

For me the rrrrbb i would pull out and shift to lookup a bank table to read some bank i for but its more to do with out I do my bank switching rather than calculate the bank, it can get a little complicated calculating if you want to support 4mb, to me anyway. Then of course byte align the other half as you did.

I also was thinking the heap could be treated a (tiny) bit like a stack with a pointer to the next 64byte offset and move it as required for allocations. In any case it might speed up allocations but it won't speed up the translation of the packed addresses.

I would say it could be worth POC at 33 NOPS if it gives more memory and allows smalltalk to work. 

PulkoMandy

Smalltalk is already working, but very slow. I don't really have a ram usage problem with a 512K machine (I decided to not handle more than that for simplicity).

Now I will be trying to make it faster. The code is open if someone wants to try making it slower but smaller :)

GUNHED

Or bigger and quicker  ;D

Great to see you advancing in this interesting project!  :) :) :)
http://futureos.de --> Get the revolutionary FutureOS (Update: 2024.10.27)
http://futureos.cpc-live.com/files/LambdaSpeak_RSX_by_TFM.zip --> Get the RSX-ROM for LambdaSpeak :-) (Updated: 2021.12.26)

Powered by SMFPacks Menu Editor Mod