Hi Axelay, thanks for a great reply! I have been heavily editing this post (as is characteristic of me) as I think more about your suggestions.
I've had a bit of a look at what I presume is the point you are doing the double buffering, and I think the problem is you have not waited for vsync with the swap to &4000 after your screen copy. […] just before your code swaps to the screen at &4000 […] the beam is not quite halfway through refreshing the monitor. More importantly, the CRTC is in the middle of displaying the current screen, and it will not reflect a change to the screen base address until it begins a new screen. So potentially, your code will go on to start modifying &c000 while the CRTC is still pointing to that area of memory for the current screen. I believe that is what is visible briefly at the bottom of the screen.
This is fantastic - thanks very much!

In fact, now that I think about it, that would only have started after I changed my vertical counter to run negatively over the scanlines instead of from top to bottom - so that I could use DEC C:JP NZ instead of DEC C:LD A,C:CP 110 - which was an infinitesimal optimisation at best.
Thus, the new rendering frame began at the bottom of the screen - just in time for the CRTC to see it, I presume, as per your insights. Actually, I might have been looping backwards even before I increased the screen size, but if so, I assume that the lowest line was too high for me to notice the problem.
Changing the loops back to incrementing from 0 to 109, a.k.a. rendering from top to bottom, has everything working perfectly again!

That's just the timing of the refresh being kind to me, though! For a moment I was concerned that I might have to do two flybacks or something, or that there'd be no way around it. But after what you said, I realised that it seems I don't need to do even
one V-sync, if the CRTC won't apply the new address until the next frame - certainly, removing the wait does not seem to harm anything. So, that's a second good hint!
However, I wonder if something like that will re-emerge when I implement more complex algorithms that might change this rather precarious balance between my code and the (however real or virtual) electron gun..
If I wanted to be
really obsessive about trivial optimisations, I suppose I could store my table of scanline addresses in reverse order and go back to decrementing . . . but insanity lies that way, I think, considering how minor the outer loop over the lines is. In fact, I might even see if I can take another slight hit to speed but gain some valuable memory by removing the table and just incrementing the address as I go along. At this rate, an extra 512 bytes might be a valuable commodity soon!
Edit: Yes, it seems that adding manually will take only one more NOP but remove the need for that table - which I have just enthusiastically deleted. Ah, I love finding new tricks! :V
Also, I hope you dont mind a comment on your screen copy. You have used a short LDI string, I guess because LDI strings are faster than LDIR? But your LDI string is too short if speed was the reason for using it. LDI is only 1 NOP faster than LDIR per byte, and with just 8 LDIs, you only have 8 NOPs saved in that loop compared to LDIR, but your loop structure with two counters is at least 8 NOPs by itself, so there is no speed gain with the LDI list that short. Although if I am not mistaken your copy completes with HL=&0000? If that's correct then a faster loop would be to replace both the A and IXH counter checks with LD A,H / OR A,L / JR NZ,Loop after your LDI string.
I can't complain about constructive comments and suggestions, really; delving into someone else's machine code, especially mine, is an effort that I think should usually be appreciated - and anyway, I think everyone here is too nice to steal anything!
I believe my calculations for this were a bit off before, but the unrolled variant should still be slightly faster (or maybe my concept of micro-time is a bit off), about 15 ms, than a simple LDIR.
Your suggestion about HL is very clever - at first it seemed like the same 16-bit loop counter 'trick'
16-bit loop counter 'trick' could still win (thankfully the added overhead of using IXH was minimal due to its low value) if I increased the level of unrolling - but then I realised that my sums had been wrong (again...) - your version seems to be slightly faster, saving me about 6 ms! Which more than outweighs the puny 1 NOP per scanline additional cost of summing to the next scanline on the screen manually instead of using the table.
Here are my latest sums:
; simple
ldir ; ~&7800x6 = 184324 NOPs / 184 ms
; unroll x8
; thus outer counter must be &7800/&100/8 = 15
ldi ; 8x5 = 40 NOPs
dec a:jr nz ; 4 = 44 x 256
dec ixl:jr nz ; 5 * 15
; = 44x256x15 + 5*15
; = 169035 NOPs / 169 ms
; unroll x12 (16 is not divisible into (&7800/&100))
; thus outer counter must be &7800/&100/12 = 10
ldi ; 12x5 = 60 NOPs
dec a:jr nz; 4 = 64 x 256
dec ixl: jr nz ; 5 * 10
; = 64x256x10 + 5*10
; = 163890 NOPs / 164 ms
; Axelay's suggestion
ldi ; 16x5 = 80 NOPs
ld a,h ; 1
or l ; 1
jr nz ; 3
; = 85 * 1920 = 163200 NOPs / ~163 ms
So, you're now a contributing author!

(Incidentally, even though there's no point changing back, I wonder if these would have been enough of a gain to stop the flickering, haha)
Btw, as for unrolling, that's a pretty nifty CLS and memory clear at the beginning, right?

It felt even more epic when there was just one gigantic contiguous area of memory to clear! But that layout had the corollary that my screen and data were not adjacent, meaning that I had to do two LDIRs per frame - so, out of a single-time loop and one that happens ever frame, it was obvious which one would have its way with my RAM layout.
Really appreciate the great comments. In particular, it might have taken me a while, perhaps forever, to realise how I'd messed up the screen refreshing, so that's brilliant. Thanks a lot!
Edit to add: I might as well attach the latest revision incorporating all the changes I've mentioned/rambled at length about: