Fast text output in mode 2, ~60,5 nops per character

arnolde · 14:55, 07 June 21

OK, so... I had this idea for a new way of printing characters on the screen...

I mean, honestly I don't think that I'm really the first one to think of this, but it's not mentioned anywhere (at least not the places I know...). CPCWiki says, Prodatrons routine is the fastest, but technically, this one ist a tiny bit faster (121735 nops for the whole screen -> 60,5 nops per character).

The way it works is quite simple: Beforehand, the font data must be rearranged so that every (aligned) memory page contains just one line of each character one after the other.
When displayed, the text is looped through 8 times, once for each scanline, and only the respective scanline of the text on the screen is written. That way, the dragging new-line calculation has to occur only 8x (in this case it's just +#30, i.e. jump over the non displayed part in screen memory) and the rest is just "inc" (or ldi, for that matter)

Of course, it has a lot of drawbacks: The text must be prepared as exactly 2000 continuous bytes in memory. All empty spaces have to be filled with " " (#20), control characters won't work (#00 and #ff will even crash the routine badly). The font has to be prepared, there's no reading font data directly from ROM. (And I imagine that all this is reason enough why no serious programmer ever considered this way of doing it really valuable...)
But, apart from speed, there are a few other advantages as well: You don't have to delete the screen first, because all empty positions are "drawn" as well. The routine is quite small in size (and could be extremely reduced in exchange for a little speed). It supports custom fonts – as long as they are arranged the way the routine needs it – and: All of the 256 characters can be used (OK, except for #00 and #ff as mentioned...)

So, please, would the experienced programmers please tell me, who already thought of this method and where it's applied? As I said, I'm sure I'm not the first one...

andycadley · 15:05, 07 June 21

It's not a million miles from how a lot of games tender background "tiles" to be honest (although sometimes they use a line by line approach to sacrifice some speed to avoid tearing as much).

Obviously the downside is it means the whole screen needs to be text (or graphics constructed from font entries), but there are probably plenty of places that's true. And doesn't play so well with hardware scrolling, which might be more of a downside for text heavy data, tbh.

Not sure why character codes of #00 or #FF need to be problematic though?

zhulien · 15:28, 07 June 21

To me that doesn't sound like he fastest way to output text, but rather a very specific scenario of wanting to fill an entire screen of text fast.

What if you only wanted to print 'Hello, World!' as fast as possible?

likely the fastest test routine for printing just 'Hello, World!' is still faster than 121735 nops.

Similarly for writing to the screen - I think the fastest way to write text to a screen is to detatch it from the physical screen - write to a logical text screen that is in fact linear and non-complicated. Only when you want to display that logical screen to a physical screen you need to convert to our beloved screen layout as fast as possible - imagine a Protext document (which is likely a linked list of lines) is their method of a virtual screen - but a virtual screen could also be a 2000 character block of ram. How fast can you write to that? If you have multiple tasks running at once, each writing to their own 2000 character logical screen, press ctrl+1 (for example) to display the logical screen to the physical, ctrl+2 to display logical screen 2 etc... you can have very fast writing to the logical screen if you are not looking at it.

arnolde · 15:57, 07 June 21

Quote from: andycadley on 15:05, 07 June 21Not sure why character codes of #00 or #FF need to be problematic though?

Because of the LDI command for copying:
If C were to be 0, B would be decremented and the wrong 2nd character would be displayed
And if L is #ff, H would be incremented and we'd get the data for the next scanlines

arnolde · 16:08, 07 June 21

Quote from: zhulien on 15:28, 07 June 21To me that doesn't sound like he fastest way to output text, but rather a very specific scenario of wanting to fill an entire screen of text fast.

Of course, I'm afraid I didn't mention clearly enough that this is indeed for a very specific scenario, t.i. filling the whole CPC screen with characters in mode 2. Bragging with 121735 / 60.5 nops, I just wanted to refer/compare to the speed indications of the other methods listed in the "Source Code" section of the Wiki.

But, as I stated myself, you are right, in most cases, this is not very practical, because doing the layout of a text to fit in 2000 bytes would need probably morde computation time than using a char by char routine and processing carriage returns etc. on the fly.

The routine could of course be adapted to print just one line or a fixed amount of characters on a specific position on the screen and I guess it would not be as bad in the speed competition neither.

arnolde · 16:20, 07 June 21

Quote from: zhulien on 15:28, 07 June 21What if you only wanted to print 'Hello, World!' as fast as possible?

Code Select


hello_text: db "Hello, world! "

fast_print_mode2:
di                     ;b/c we mess with the stack
ld iy,hello_text
ld (spgarage),sp       ;backup sp
ld de,#c000            ;start of screen memory
ld h,font_table/#100   ;1st line of symbol data
  lineloop:
  ld sp,iy             ;every newline -> restart text from beginning
      repeat 7      
      pop bc       ;3  ;pop fetches two characters at once
      ld l,c       ;1  ;point HL to font data of 1st character
      ldi          ;5  ;transfer to screen
      ld l,b       ;1  ;same with the/
      ldi          ;5  ;/other character
      rend             
  inc h                ;next line in font memory (is aligned)
  ld a,e:sub 14:ld e,a
  ld a,d:add 8:ld d,a        
  jp nz,lineloop       ;and if d=0 we're done
ld sp,(spgarage)       ;restore sp
ei

~70 nops/char... not horribly slow either...

andycadley · 16:31, 07 June 21

Quote from: arnolde on 15:57, 07 June 21
Because of the LDI command for copying:
If C were to be 0, B would be decremented and the wrong 2nd character would be displayed
And if L is #ff, H would be incremented and we'd get the data for the next scanlines

Is LDI the fastest way here though? You don't really need 16-bit incrementing because you know it will only change the high byte every 256 characters. My gut says this can be unrolled differently to cope with that and get a better overall performance though I haven't looked hard enough at timings to confirm...

arnolde · 16:36, 07 June 21

Quote from: andycadley on 16:31, 07 June 21
Is LDI the fastest way here though?

Well, as far as I understand it, without LDI we'd need 4 nops for the data transfer and 2 for Inc E and L, that's 6 while LDI takes only 5 nops. But there might be another way?

m_dr_m · 17:58, 07 June 21

You are right, many thought of that! Typically it's not worth it. An alternative way is to arrange the font in Gray order so you only have RES/SET one bit to change line.

Orgams' routine is a bit slower than Prodatron's & al:
- 57 Nops for space.
- 75 Nops for other chars.
- So ~72 Nops in average for english text (and less for assembly).
This includes string management and properly looping at C7FF->C800 when screen has scrolled.

That's still very fast, as the 'type' command in the monitor would demonstrate.

I completely agree with @zhulien : your routine would only be faster in scenarios ... that might seldom occur.
A fast text-line clear routine is ~8 Nops per character.
If a string is 60 chars long rather than 80, Prodatron's would take 80*8 + 60*65 (3890) < 80*60.5 (4840).

Now, if the line is already cleared, displaying space becomes even faster: you just skip it, lowering evermore the average per char.

Related: http://www.cpcwiki.eu/forum/programming/fast-mode-2-text-printing-routines/

GUNHED · 19:04, 07 June 21

60 NOPs is very nice, even with FutureOS the minimum is 45 NOPs per character.

arnolde · 19:22, 07 June 21

Quote from: m_dr_m on 17:58, 07 June 21Related: http://www.cpcwiki.eu/forum/programming/fast-mode-2-text-printing-routines/

Thank you for this link, I was looking for a thread like that before posting this but I apparently used the wrong keywords... I will devour it later!

QuoteYou are right, many thought of that! Typically it's not worth it.

Yeah, I start to understand that. I played around a little bit with ways to prepare the layout, but even if I only leave the loop when finding #0D#0A (and fill the rest of the line with " " instead), it takes too much time and "my" display method is slower in the end.

arnolde · 19:53, 07 June 21

Quote from: m_dr_m on 17:58, 07 June 21http://www.cpcwiki.eu/forum/programming/fast-mode-2-text-printing-routines/

OK. His main routine is literally the same as mine:

Code Select

pop bc
ld l,c
ldi
ld l,b
ldi

But he also has a quite clever iteration calculation so he can display all lengths and not just a whole page.

m_dr_m · 05:04, 08 June 21

Quote from: GUNHED on 19:04, 07 June 2160 NOPs is very nice, even with FutureOS the minimum is 45 NOPs per character.

Impressive. I wonder how you get this number. It seems either too big, too small or irrelevant

Now on a related note, I plan to make the ultimate hex editor (see https://www.cpcwiki.eu/forum/applications/monogams-to-behave-more-like-a-file-based-hex-editor/),
with a memory dump so fast your CPC will age slower than you.
I also plan to use a dedicated mini font (e.g. height either 6 or 7 lines rather than

for more real estate.

GUNHED · 12:42, 08 June 21

Quote from: m_dr_m on 05:04, 08 June 21
Impressive. I wonder how you get this number. It seems either too big, too small or irrelevant

All explained in the topic you posted before. In brief: smart control codes.

Quote from: m_dr_m on 05:04, 08 June 21I also plan to use a dedicated mini font (e.g. height either 6 or 7 lines rather than for more real estate.

Funny! With FutureTex I use 9 scanlines and the middle 7 of them are for character data (for most chars), just to make visibility better. You definitely must have gotten eagle eyes.

McArti0 · 19:05, 06 August 23

When we use a stack, we must turn off Interrupts.
A not much slower method is:

LD A,(BC)
LD L,A
LDD

Full procedure to real string 255 chars:

Code Select

;;Call &9800,"Text",first char screen adress
;;Call &9c5c - convert after SYMBOL AFTER 0:MEMORY &8FFF
;;Call &9c8c - convert from ROM to &9000 (MEMORY &8FFF needed)
org #9800
LD hl,ending
push hl

ld h,(IX+3)
ld l,(IX+2)
push hl
ld a,(HL) ;count chars text
ld l,a
ld h,0
add hl,hl
add hl,hl ;x4
ld b,h
ld c,l
ld hl,start
sbc hl,bc
ld (n_chars_start+1),hl

pop hl

inc hl
ld c,(HL)
inc hl
ld b,(HL)
;ld bc,starttext
ld l,a
ld h,0
add hl,bc
dec hl
ld b,h
ld c,l
;ld bc,endtext
push bc

;ld de,start_text_screen_adr
ld h,0
ld l,a
add hl,de

dec hl
push hl
;ld de,end_text_screen_adr

ld de,&800

push bc
set 3,h
push hl

push bc
add hl,de
push hl

push bc
set 3,h
push hl

push bc
add hl,de
push hl

push bc
set 3,h
push hl

push bc
add hl,de
push hl

set 3,h

ld ixl,d ; Set line count 8

ld d,h
ld e,l

ld hl,#9700 ; 8 page of 8 pages chars table

jp n_chars_start

.max255_char
repeat 255
 ld a,(bc)
 ld l,a
 ldd ;hl->de
rend

;0
.start

dec IXL ;counter line up 8 to 1 down

ret z

pop de
pop bc

dec h ; dec num of pages

.n_chars_start
jp start ;or start - 4x count chars

.ending
ret

.db8
db 0,0,0,0 ;Because 9C5C is a nice address

;org 9c5c
;;Call &9c5c - convert after SYMBOL AFTER 0:MEMORY &8FFF
.convertSymbolAfter0
LD HL,#9EFC
LD BC,#2000
LD DE,#9000
ld a,0 ;256
.convLoop
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec e
inc d
LDI
dec a
ret z
ld d,#90

jp convLoop

;;Call &9c8c - convert from ROM to &9000 (MEMORY &8FFF needed)
;org 9c8c
.convertROM
call &b906
ld hl,&3800
call convertSymbolAfter0+3
call &b909
ret

News:

Fast text output in mode 2, ~60,5 nops per character