This tiny piece of code is a slow software monochrome sprite blitter that includes masking.
What do you need to know?
In DE you should provide a offscreen pointer, that is it, an address within a 6.144 bytes reserved space mapped as VRAM character generator table arranged as in SCREEN 2. If your sprite should be drawn at (128,90) and your offscreen space is called SCREEN, you should load DE with SCREEN+(90 DIV 8 )*256+(90 MOD 8 )+(128 DIV 8 )*8. That would be SCREEN+11*256+2+128. By the way, SCREEN pointer should be aligned to 8-byte addresses.
In HL you should provide a pointer to a masked sprite. The given format is as it follows:
offset 0 - size 1 -> "Y" - vertical size in pixels
offset 1 - size 1 -> "X" - horizontal size in 8-pixel blocks
offset 2 - size (X*Y*2) -> "DATA" - interlaced mask byte + data sprite byte pattern
Of course, then you need to somehow blit it to VRAM. A full 6.144 LDIRVM does the trick, but it is so slow.
PUTSPRITE:
; DE=screen position; HL=sprite pointer
ld b,[hl]
inc hl
ld c,[hl]
inc hl
@@main:
push bc
push de
ld c,8
@@loop:
ld a,[de]
and [hl]
inc hl
or [hl]
inc hl
ld [de],a
inc e
ld a,e
and 7
jp nz,@@ok
ld a,e
sub c
ld e,a
inc d
@@ok:
djnz @@loop
pop de
ex de,hl
add hl,bc
ex de,hl
pop bc
dec c
jp nz,@@main
ret
It is supposedly optimized for speed, but it is indeed very slow. Anyone willing to improve it? Of course, if masking is not required, it can be rewritten and optimized again in order to avoid so many slow stack operations.
I've included a tiny ROM with just two overlapping software sprites. Graphics taken from Joe Blade! Note that each sprite is 32x32 pixels at 1 bpp with masking (2 bpp). Therefore it supposes 256 bytes of data for each sprite and this means 256*3=768 memory reads and 256 memory writes, without accounting stack operations!