Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!uwm.edu!bionet!agate!ucbvax!bloom-beacon!eru!hagbard!sunic!mcsun!ukc!acorn!john
From: john@acorn.co.uk (John Bowler)
Newsgroups: comp.arch
Subject: Re: Bitfield and loop instructions--a good idea?
Summary: Use in bitblt
Message-ID: <6444@acorn.co.uk>
Date: 17 Apr 91 12:32:19 GMT
References: <1991Apr15.193425.3436@waikato.ac.nz> <2302@spim.mips.COM> <MEISSNER.91Apr16125239@curley.osf.org>
Organization: Acorn Computers Ltd, Cambridge, UK
Lines: 123

In article <MEISSNER.91Apr16125239@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:
>
>| Rlimi does left and right logical immediate shifts. (I think it corresponds
>| to what most shifter hardware looks like internally as well.) I hacked some
>| assembly for this machine and was amazed at how often rlimi came in handy.
>
>I wouldn't be surprised if a workstation running X, actually could do
>a fair bit of these instructions in the server bitblt's.  Of course if
>you don't have a direct display, but use an X terminal, you wouldn't
>see these.  However, it's probably harder to measure dynamic
>instruction frequency in such a case.....

The inner bitblt loop of an X server becomes very important under some
circumstances.  Particularly text scrolling and window movement (these both
have a significant effect on the user perception of performance).

The main problem with bit blt in X is reducing memory access.  Frequently
screen memory is not cached ('cause it tends not to be accessed in a
localised fashion), so what matters is avoiding accessing the same word
more than once.  The general scan line raster op is (I may get this
wrong...):-

	dst := dst & ~mask | (dst OP src) & mask

(mask for the plane mask and to deal with the start and end of the scan
line).  Given this and a machine architecture which favours word aligned
access (or, indeed, mandates it!) it is convenient to align the ``src''
value to the ``dst''.  Then the algorithm becomes:-

	load a word of dst
	load the next word of src
*	use previous src and newly loaded word to calculate a
		dst-aligned src word
	perform the calculation (above)
	loop

All the bit twiddling is in (*).  If the architecture *only* has one bit
shifts, resign and go to work for a company which uses a sensible
architecture.  If the architecture *only* has immediate constant shifts
write a raster-op compiler for the above code which will compile the
appropriate instructions for a given dst/src mis-alignment.  Do all the
optimisations you can (eg for co-aligned dst/src; as encountered in vertical
scrolling (xterm) - no shifts required!)

If the architecture has shifts by amounts held in registers, forget
raster-op compilers (the setup cost is too great), do the shifts in-line.
The best architecture allows the combination of logical shift operations
with logical arithmetic; you make ``src'' using (on a little endian
architecture):-

	src = prev >> shift;
	prev = *srcptr++;
	src |= prev << (BPW-shift);

(BPW = Bits Per Word).

Given that (BPW-shift) is ammenable to CSE by the compiler this is about
four instructions (well, it's three instructions on the ARM; for those who
understand the ARM assembler:-

	MOV	src, prev, LSR shift
	LDR	prev, [srcptr], #4
	ORR	src, src, prev, LSL BPWshift
)

Conclusions?  The general bit extract instruction is not necessary -
the bits are (by careful arrangement of the algorithm) always at one
end or the other of the word.  Logical shifts by variable amounts
are essential, being able to specify the amount in a register helps
(incidentally doing this doubles the execution time on the ARM -
four registers in the instruction rather than the normal three).
Being able to combine shifts with logical operations helps slightly.

The only way in which I can see a general bit extract helping is
if an instruction of the form:-

	extract 32 bits from this register *PAIR*

exists; then the above can be reduced to two instructions.  Is this
really worth it?

This loop is for a general raster op (ie the loop does all the possible X
raster ops, not just one of them) and for any source/destination alignment.
On an ARM the loop amounts to 17 instructions of which 14 are arithmetic
(logical), and 3 access memory.  The whole algorithm can be expressed in C
- not only are there no hardware ``tricks'' involved, there is no use of
inaccessible machine operations such as rotate.

Of these instructions only two involve bit shifts.  Hence a further
conclusion; even in code where bit level operations are (apparently)
important they are still relatively infrequent.  In the above loop (on the
ARM) the shift instructions account for under one seventh of the total time.
I'm not sure that any meaningful conclusions can be drawn from this, but,
for comparison, the memory access takes over one third of the time and
doing the raster op takes about one third.

When the source is being copied to the destination and the mask is
inactive (all ones) - the common case, the raster op is not necessary
and one memory access can be saved (loading dst).  In this case (again
on the ARM) 6 instructions are needed and almost one quarter of the
time goes in the shift operations.  One half goes in memory access
(the other quarter is in the two instructions which do the DBRA
equivalent - no delayed branches!).  This code is equivalent to
optimised ARM code for bcopy/memcpy; so the case is important in other
programs too.  Only in this case is the overhead of bit twiddling or
of loop handling starting to appear - and this is the most extreme
case which I can think of.

In practice (for memcpy) loop unwinding and special casing of the various
shifts starts to pay off.  But loop unwinding allows sequential memory
access for load/store which is a big win, potentially reducing memory access
time to a half of what it would otherwise be; forget the saving on loop
instructions!  Special casing the shifts halves the cost on the ARM - but
this is a matter of having the instuctions go fast, not of having wacky
instructions which do appropriate things.  The real irony is that these
*programming* techniques make the advantage of even the special purpose CISC
byte string movement instructions relatively small; even non-existent if
they are not actually implemented optimally (in particular if they do not
take advantage of sequential mode access to DRAM when other instructions
can).

John Bowler (jbowler@acorn.co.uk)