News:

Printed Amstrad Addict magazine announced, check it out here!

Main Menu

Fast MODE 2 text printing routines

Started by opqa, 12:30, 25 January 15

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

opqa

I've always thought that the strange memory layout of the CPC could be useful for one single thing, fast text printing. But to make profit of it, "special" printing routines based on a line per line print basis instead of a character by character basis should be used. Oddly enough, the firmware doesn't use this technique, neither I've seen it being used anywhere, so I decided to create my owns and share them with the forum.

In the attached zip there are three routines, copycharset, superfastprint, and fastprint.

Copycharset must be invoked once, and what it does is to copy the whole 2KB's ROM charset in RAM but re-arranged in a line per line basis. The portion of RAM used to hold this copy can be chosen at assembling time altering the CHARSET_BASE constant, it can be placed at the beginning of any 256B page. Of course, if you're not going to use the whole charset you can overwrite the space taken by the unused characters. The only problem is that this space will be evenly distributed in 8 different "holes" along the 2KB's. Somewhat like what happens with the free space of the CPC screen along 16KB's. Once the charset has been copied you can use superfastprint or fastprint as specified in the comments.

Superfastprint is the fastest, but it needs to disable the interrupts while printing, as it uses the stack pointer to read the text buffer.
Fastprint is a little bit slower, but it can be interrupted safely.
There's a third option with an intermediate speed. Superfastprint can be assembled with REENABLE_INTERRUPTS constant set to 1, then it will run with interrupts disabled but will briefly re-enable them every 64 printed bytes (~7.5 scanlines).

Here are some performance tests, they are obtained printing a whole 2000 chars screen. In the cases where interrupts are enabled totally or partially, the interrupt routine has been patched to "ei:ret" so that its duration doesn't influence the result.

Superfastprint: 60,71 us/char
Superfastprint*: 63,56 us/char
Fastprint: 64,97 us/char

* with interrupt re-enabling

I think the results are quite good compared to another fast text output printing routine I've seen around there, this ones are smaller in size, at least as fast and more versatile (they can print almost all the charset and much more than a whole screen at once).

TFM

Nice piece or work!!!  :)

Best value I ever got was around 45 us for one character, but it slows down with complexity of test.

Displaying characters quick with FutureOS - YouTube

(starting at second 27 or so).
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Ast

_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

Prodatron

#3
Opqa, that's very impressive and much more useful than only copying the same single char hundert of times over the whole screen.
It seems that your methode is good for printing complete screens of "real" texts. I wonder what's the performance is for small text portions?
You probably know this:
Programming:Fast Textoutput - CPCWiki
It is working with interrupts and has an average speed of 65Nops/Char, which is the same speed like your IRQ friendly Fastprint routine. It requires some memory but works for small text portions at this average speed, too. I wonder what your routine would reach if it is used for short texts? Unfortunately I didn't have time for a closer look at it yet.

CU,
Prodatron

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

opqa

#4
You're right, I was to write about this but couldn't until now. At it current state it downscales really bad. I made all the tests and optimizations over whole screens printing. It has room for improvement about this aspect. I've been experimenting with faster versions for smaller texts that can't print so many characters at a time. But in the end, the routine is going to be good only for printing relatively long chunks of text, because it's based on the fact printing over the same character line of text is fast, while changing from one line to another is slow.

It all depends on the use, your routine is great for general purpose use, but it can't print the whole charset and is a little bit larger in size than mines (3,5KB vs ~2.3KB).

My routines are good for printing large chunks of text and occasionally small portions (although it is slower in terms of speed per character printing small texts is always fast). But it sucks at printing many small portions of text one after another in quick succession.

@TFM
Your video is also impressive. What printing technique are you using, char per char or line per line? Is there source code available?

PD: If anyone is interested, the "core" of the routines is simple, for the interrupt friendly version:

ld a,(bc)
ld l,a
ldd

Where BC holds pointer to the text buffer, H the charset base address (the 3 LSB are the character line being printed), and DE the screen address. The routine prints the text backwards, from right to left. Lower bound for this routine is 64us/char (8us/byte).

And for the faster version:

pop bc
ld l,c
ldi
ld l,b
ldi

Where SP holds the pointer to the text buffer, H the charset base address and DE the screen address as before. This one prints the text forward in pairs of bytes. As a lower bound it takes 15us/pair, so 7,5us/byte -> 60us/char.

TFM

Quote from: opqa on 12:15, 27 January 15
@TFM
Your video is also impressive. What printing technique are you using, char per char or line per line? Is there source code available?


Hi! Well, I basically use control codes which are quick when printing the same character multiple times in X or Y. With increasing diversity in text it get's a bit slower. However you can use all 256 characters and you can change them, it's not fixed to few unchangeable characters...


Yours is actually very impressive. And printing out a whole page is what actually happens in applications, games and demos. Maybe not in word processors, but in this case the user is slower than the routine.  :laugh:


Great work!!!  :)
TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Gryzor

What are the normal speeds though, for comparison's sake?

TFM

Quote from: Gryzor on 19:02, 27 January 15
What are the normal speeds though, for comparison's sake?


The following section may give you an idea, but even if it is not directly checking that alone:
Speedcheck - CPCWiki
(The most interesting part could be the display of 64 KB hex dump).
I don't know if there is a HEX monitor for Symbos, but Prodi can for sure add that data.

TFM of FutureSoft
Also visit the CPC and Plus users favorite OS: FutureOS - The Revolution on CPC6128 and 6128Plus

Ast

Just for the fun and for my iMPdraw, i do my own routine to print a fast char 8x8.
My routine takes 53 nops per char (8x8).... Maybe I can do better using the stack pointer but i'm not sure  :-X
_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

opqa

#9
I've been working in further optimizations to improve the short texts printing speed. I've focused on the interrupt-friendly fastprint routine, changes I've made don't alter the long text speed or other functionality. The only drawback is that one of the optimizations prevents from porting it to ROM, but it is a smaller one and could be undone.

These are the numbers for fastprint right now:

For 1 char: 370 us       -> 370 us/char
For 10 chars: 946 us   -> 94,5 us/char
For 40 chars: 2875 us -> 71,87 us/char
For 80 chars: 5475 us -> 68,44 us/char
For 160 chars: 10650 us -> 66,56 us/char
For 1000 chars: 65077 us -> 65,1 us/char
For 2000 chars: 129866 us -> 64,93 us/char

So now I think downscaling is not so bad, it becomes competitive from about 1 line of text on. Maybe it can be improved further, I don't know. New sources attached to this post.

Quote from: Ast on 04:47, 28 January 15
Just for the fun and for my iMPdraw, i do my own routine to print a fast char 8x8.
My routine takes 53 nops per char (8x8).... Maybe I can do better using the stack pointer but i'm not sure  :-X

This number is impressive, almost unbelievable. Is it really for a general 8x8 char? Could you paste the relevant part of the source code?

Ast

#10
No problem, i'll post it tonight so you'll see...
:D


Édit : And yes it's to display a 8x8 char (mode 2) so 1 mode 2 char with 8 lines.
I've coded it last night for iMPdraw text display.
_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

Prodatron

I am really curious about it! :)

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

Ast

As I wrote you, here comes my printing routine view...
All you have to know is that the font may be converted.



afftxt   
            ld bc,#C000 ; you know why?
            ld de,fnt ; fnt adress is where is loaded your converted font
            sub 32 ; coz I want to start by space char
            ld h,0
            ld l,a    ; a=char you want to print
            add hl,hl ; x2
            add hl,hl ; x4
            add hl,hl ; x8
            add hl,de ; add new position with your fontchar
            ex de,hl   ; font start in DE
            ld h,b
            ld l,c       ; get screen adr in HL
;
;          here comes the display
;
            ld a,(de)
            ld (hl),a     ; #c0xx
            inc de
            set 3,h      ; #C8xx
            ld a,(de)
            ld (hl),a
            inc de
            set 4,h      ; #D8xx
            ld a,(de)
            ld (hl),a
            inc de
            res 3,h     ; #D0xx
            ld a,(de)
            ld (hl),a
            inc de
            set 5,h     ; #F0xx
            ld a,(de)
            ld (hl),a
            inc de
            set 3,h     ; #F8xx
            ld a,(de)
            ld (hl),a
            inc de
            res 4,h    ; #E8xx
            ld a,(de)
            ld (hl),a
            inc de
            res 3,h    ; #E0xx
            ld a,(de)
            ld (hl),a
            ret

it's possible to win some us if you interlace some inc de/inc e


Have a good fun !

_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

Prodatron

But these are 8x8 NOPs for the core part (LD A,(DE):LD (HL),A:INC DE:RES/SET x,H:...) + a lot of more NOPs for all the stuff around.
How do you come to 53?

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

Ast

#14
In fact, at the beginning, i only use inc e, so 56 nops... sorry for mistake!  :laugh:
Hi prodatron ! Isn't it correct?


7x8=56 nops


Edit : Using inc e/inc de use only 4 nops more
so 60 us for all chars printing....


or you can use this way :

pop hl ; adr
ldi       ; tranfert
pop hl
ldi      ; 8 times

but 64 us...
_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

Prodatron

Yes, you can use INC E instead of INC DE, as each char is probably 256 byte aligned. So now it's 56 NOPs for the core part. But you still have all the overhead stuff around it like calculating the address of the matrix for each new char, jumping back to the next screen address etc.
The interesting value is the true average time for one char which includes really everything, even the loop code.

Opqas' code is very impressive, as it's a completely new idea and for large texts as fast as my methode or even faster when using SP. What's about adding it to the Wiki article? With my routine a 256 charset is possible when decreasing the average speed from 65 to 67 NOPs.

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

opqa

#16
I like Ast idea, a lot, you've used the same set/res technique as in Prodratron et al. routine. I experimented with it in my own routines but it didn't fit well.

But...I've been thinking about Ast routine and it and it has potential to become as least as fast as my routine but without those downscaling problems.

What I would do is a mixture is between the combination of Ast's charset arrangement and mine. This is, character ascii code taking the whole low byte, and line inside the character the tree LSB's of the high byte of the address, but "disordered" like in Ast routine. "inc de" can be subtituted by "inc d" always in that case, the only requisite for the charset is being 256bytes aligned.

The core code would be like this:


B - Charset base address
DE - Screen address
HL - Text buffer address

; Build the next character address
ld b,CHARSET_BASE ; 2
ld c,(hl) ; 2               
             ; 4 until now

; Start printing
ld a,(bc)  ; 2
ld (de),a  ; 2
inc b      ; 1
set 3,d    ; 2

ld a,(bc)
ld (de),a
inc b
set 4,d

ld a,(bc)
ld (de),a
inc b
res 3,d

ld a,(bc)
ld (de),a
inc b
set 5,d

ld a,(bc)
ld (de),a
inc b
set 3,d

ld a,(bc)
ld (de),a
inc b
res 4,d

ld a,(bc)
ld (de),a
inc b
res 3,d

ld a,(bc) ; 2
ld (de),a ; 2
res 5,d   ; 2
          ; 7x7 + 6 = 55
          ; 59 until now

; Increase screen address
inc de    ; 2
; Increase text buffer address
inc hl    ; 2

          ; TOTAL = 63us


The loop overhead is still to be added but this has many possible variations, it can either be partially unrolled like in my routines to reduce its impact over long texts penalizing short ones, or it can be reduced to just "dec ixl: jp nz,beginning", which would add 5 extra us.

EDIT:
The previous routine using the stack (not interrupt friendly), 61,5us/char + loop overhead if my summations are right. There would also be some overhead from the stack moving code, but this would be only once, not per char.

H   - Charset base address
DE - Screen address
SP - Text buffer address

pop bc            ; 3
ld h,CHARSET_BASE ; 2
ld l,c            ; 1
                   ; 6 until now
ld a,(hl)  ; 2
ld (de),a  ; 2
inc h      ; 1
set 3,d    ; 2

ld a,(hl)
ld (de),a
inc h
set 4,d

ld a,(hl)
ld (de),a
inc h
res 3,d

ld a,(hl)
ld (de),a
inc h
set 5,d

ld a,(hl)
ld (de),a
inc h
set 3,d

ld a,(hl)
ld (de),a
inc h
res 4,d

ld a,(hl)
ld (de),a
inc h
res 3,d

ld a,(hl) ; 2
ld (de),a ; 2
res 5,d   ; 2
          ; 7x7 + 6 = 55
              ; 61 until now
; Increase screen address
inc de    ; 2
; Build the next character address
ld h,CHARSET_BASE ; 2
ld l,b            ; 1
             ; 66 until now
ld a,(hl)  ; 2
ld (de),a  ; 2
inc h      ; 1
set 3,d    ; 2

ld a,(hl)
ld (de),a
inc h
set 4,d

ld a,(hl)
ld (de),a
inc h
res 3,d

ld a,(hl)
ld (de),a
inc h
set 5,d

ld a,(hl)
ld (de),a
inc h
set 3,d

ld a,(hl)
ld (de),a
inc h
res 4,d

ld a,(hl)
ld (de),a
inc h
res 3,d

ld a,(hl) ; 2
ld (de),a ; 2
res 5,d   ; 2
          ; 7x7 + 6 = 55
          ; 121 until now
inc de ; 2
          ; TOTAL = 123 per character pair


In this case this version is not so interrupt-unfriendly, as the text buffer is only read once, if it is considered to be one-use only then the routine can be used without deactivating the interrupts, you just need to care to reserve some spare bytes before the beginning of the text buffer.

Ast

Inc E would have to work in each case as each char is 256 bytes aligned.
I didn't say that Opqas's code is not impressive but I only said my print routine could be faster, that's all.... I just want to add my help in this topic, no more.  ;D


So when you make the count it's 53 us, no more!!!!


here is the calc :
ld a,(de):ld (hl),a:set/res:inc e ; 7 us
ld a,(de):ld (hl),a                       ; 4 us


->(7x7)+4 = 53!
_____________________

Ast/iMP4CT. "By the power of Grayskull, i've the power"

http://amstradplus.forumforever.com/index.php
http://impdos.wikidot.com/
http://impdraw.wikidot.com/

All friends are welcome !

Prodatron

@Ast: Yes, that's the time for the core part. But for printing text strings you need to include all other stuff around it as well to get a realistic result.

@Opqa: Wow, I like this new solution very much! 68 or 66,5 NOPs [if using SP] per char, but static (doesn't depend on the size of the text anymore)! (MaV+my one would be 65 or 67 NOPs [if addressing all 256 chars]).

Maybe it's time to extend the Wiki article. TBH I love these hardcore CPC/Z80 optimization topics and discussions, thanks a lot! :)

GRAPHICAL Z80 MULTITASKING OPERATING SYSTEM

opqa

Well, so here's the new idea made true. I've implemented the "simple" version which doesn't use the stack. It's much better than all the previous routines posted in this thread, a little bit slower for long texts, but it is very fast anyway and the visual behaviour is better.

There is a single routine that can be compiled to support either "short" texts up to 256 characters and much longer ones. You can choose between the two of them with the assembly variable LONGTEXT. The second one has some small additional initial overhead for setting the counters and an even smaller one for the outer loop counter (every 256 chars, so impact is minumum).

pmeier

#20
Is there a chance to get newfasttext working for MODE 1?

I'm just programming a very simple game, but now I want to draw a level faster. And this routine looks perfect...

UPDATE: I found cpct_drawStringM1_f in CPCtelera. Unfortunately it's a little bit too slow. (1s for the whole screen.)
You see, I try to do my homework ;-) Any comments appreciated.

pmeier

#21
Sorry, to ask again, but I was not able to speed up cpct_drawStringM1_f() from CPCtelera.I tried to hardcode foreground and background color. This did not improve the speed.

And modifying newfasttext is beyond my skills. I studied http://cpctech.cpc-live.com/source/sixpix.html which does a mode 2 to mode 1 conversion.

Background: I'm writing a little BASIC game, that displays the MODE 1 text levels with cpct_drawStringM1_f(). It's already quite playable, but faster level switching would improve the fun a lot...

ronaldo

#22
Quote from: pmeier on 21:17, 04 May 18
Is there a chance to get newfasttext working for MODE 1?

I'm just programming a very simple game, but now I want to draw a level faster. And this routine looks perfect...

UPDATE: I found cpct_drawStringM1_f in CPCtelera. Unfortunately it's a little bit too slow. (1s for the whole screen.)
You see, I try to do my homework ;-) Any comments appreciated.
Drawing mode 1 and mode 2 text is different, because pixel codification is totally different. cpct_drawStringM1_f is optimized for speed, and it's quite fast compared to other similar routines. If you want to draw text much faster, you probably need to switch to drawing sprites and creating a custom font made of sprites. Drawing coloured text out of ROM character definitions requires converting them to pixel values, and that takes CPU time for modes different than mode 2.

Another thing I don't understand is, are you using text drawing routines to draw maps on screen? Why don't you use sprites for a game? Why using text? Is it that you are programming your game in BASIC and using text drawing routines as RSX commands?

pmeier

#23
My idea was, just code a simple BASIC game, with the knowledge I already had when I was 12. But then I found these nice assembler routines, which could speed up screen drawing.

Of course you wonder why I just don't code everything in assembler, use sprites etc. But frankly that's beyond my skills. I'm glad so far that I could adapt the method cpct_drawStringM1_f to MAXAM and use fonts from RAM.

My levels have only two colors at the moment. But my hack didn't speed up the code. And of course the level has many spaces. So there is plenty of room for optimization...

Thank you very much for your answer. Maybe I should rework the whole concept. (I don't use RSX, just calls to the assembler routines.)
And of course I already noticed that I have to double the pixels and the examples I found are also well commented, but still a huge challenge for me...

ronaldo

#24
Okay, I understand what you are doing. If that is your idea, then I would advice you to do a full-BASIC game as a start. Create your game, engine, animations, etc in BASIC and finish your game. After finishing one complete fully-functional it would be better to accept greater challenges, like adding assembler routines as RSX, program in C or using sprite drawing routines. For instance, to draw screen sprites (like you are doing with characters), CPCtelera's drawTile functions are lightning-fast. Adapting them for following projects could be a great improvement. But I think going step by step is better approach and more rewarding.


You also can look for other approaches like using 8BP, which is an RSX game engine for BASIC games. It could be a nice option for you too. In any case, I always advice to go step by step, enjoying and learning at each phase, and not trying to progress too fast :) .


By the way, I reviewed CPCtelera string drawing routines and I worked out new ways to improve and make them much faster. Your comment gave me a great idea. Thank you :)

Powered by SMFPacks Menu Editor Mod