Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!pacific.mps.ohio-state.edu!linac!att!att!cbnewsk!cbnewsj!dwex
From: dwex@cbnewsj.att.com (david.e.wexelblat)
Newsgroups: comp.sys.3b1
Subject: Re: Replacement for wind.o
Keywords: MGR
Message-ID: <1991May6.152805.10583@cbnewsj.att.com>
Date: 6 May 91 15:28:05 GMT
References: <1991May3.163220.24448@cbnewsj.att.com> <1991May4.062447.7923@yenta.alb.nm.us>
Organization: AT&T Bell Laboratories
Lines: 64

In article <1991May4.062447.7923@yenta.alb.nm.us> dt@yenta.alb.nm.us (David B. Thomas) writes:
	
	[stuff deleted]
> 
> 3. My new job has me writing bit blit routines in assembly languages all day
> long.  What's one more?  I'm going to code all of mgr's bitblits in 68010
> assembler and get this baby cookin'.
> 
> 					little david
> -- 
> Unix is not your mother.

Are you away of the loop-mode instructions for the 68010?  They are
discussed on the last few pages of the 68000-68008-68010 book from
Motorola.  I did some testing, and for long copies (> ~100 bytes)
they are a whole lot faster.  Apparently the compiler doesn't
use them.  I wrote a memcpy()-type routine, and compiled it with
and without the optimizer, and it did not use these instructions.
The libc.a versions do use them, so either these were hand-coded
in assembler, were hand optimized, used a different compiler, or
I'm missing something.  The MGR bitblt could be sped up a log
just by using these instructions.

The way they work (this is from memory; my book is at home) is as
follows.  Given a normal copy function:
	
	for (i=100; i > 0; i--)
		*dest++ = *src++;

the compiler outputs something like:

	mov.l	&100,%d0
	mov.l	dest,%a0
	mov.l	src,%a1
top:	mov.b	(%a1)+,(%a0)+
	sub.l	&1,%d0
	bgt	top

Convert this to

	mov.l	&100,%d0
	mov.l	dest,%a0
	mov.l	src,%a1
top:	mov.b	(%a1)+,(%a0)+
	dbf	%d0,top

and the 68010 read this as loop mode (due to its prefetch), and 
does not fetch the move or branch instructions again, saving 4
memory accesses (1 for mov.b, 1 for sub.l, and 2 for bgt).  This
is a big win.  Note that it only works for  branches with a
negative displacement of 4 (i.e. one instruction before the
dbxx), which happens to be ideal for copies.

Anyhow, I thing this would make a huge improvement to MGR,
since it showed me approx 10 times the performance on a
quick 1000-byte-copy benchmark.  Check it out.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
David Wexelblat             | dwex@mtgzz.att.com    | I asked her her name.
AT&T Bell Laboratories      | ...!att!mtgzz!dwex    |   She said her name was
200 Laurel Ave - 4B-421     |                       |      'Maybe'
Middletown, NJ  07748       | (201) 957-5871        | --Damn Yankees