Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!uwm.edu!bionet!agate!ucbvax!bloom-beacon!eru!hagbard!sunic!mcsun!ukc!acorn!john From: john@acorn.co.uk (John Bowler) Newsgroups: comp.arch Subject: Re: Bitfield and loop instructions--a good idea? Summary: Use in bitblt Message-ID: <6444@acorn.co.uk> Date: 17 Apr 91 12:32:19 GMT References: <1991Apr15.193425.3436@waikato.ac.nz> <2302@spim.mips.COM> Organization: Acorn Computers Ltd, Cambridge, UK Lines: 123 In article meissner@osf.org (Michael Meissner) writes: >In article <2302@spim.mips.COM> zalman@mips.com (Zalman Stern) writes: > >| Rlimi does left and right logical immediate shifts. (I think it corresponds >| to what most shifter hardware looks like internally as well.) I hacked some >| assembly for this machine and was amazed at how often rlimi came in handy. > >I wouldn't be surprised if a workstation running X, actually could do >a fair bit of these instructions in the server bitblt's. Of course if >you don't have a direct display, but use an X terminal, you wouldn't >see these. However, it's probably harder to measure dynamic >instruction frequency in such a case..... The inner bitblt loop of an X server becomes very important under some circumstances. Particularly text scrolling and window movement (these both have a significant effect on the user perception of performance). The main problem with bit blt in X is reducing memory access. Frequently screen memory is not cached ('cause it tends not to be accessed in a localised fashion), so what matters is avoiding accessing the same word more than once. The general scan line raster op is (I may get this wrong...):- dst := dst & ~mask | (dst OP src) & mask (mask for the plane mask and to deal with the start and end of the scan line). Given this and a machine architecture which favours word aligned access (or, indeed, mandates it!) it is convenient to align the ``src'' value to the ``dst''. Then the algorithm becomes:- load a word of dst load the next word of src * use previous src and newly loaded word to calculate a dst-aligned src word perform the calculation (above) loop All the bit twiddling is in (*). If the architecture *only* has one bit shifts, resign and go to work for a company which uses a sensible architecture. If the architecture *only* has immediate constant shifts write a raster-op compiler for the above code which will compile the appropriate instructions for a given dst/src mis-alignment. Do all the optimisations you can (eg for co-aligned dst/src; as encountered in vertical scrolling (xterm) - no shifts required!) If the architecture has shifts by amounts held in registers, forget raster-op compilers (the setup cost is too great), do the shifts in-line. The best architecture allows the combination of logical shift operations with logical arithmetic; you make ``src'' using (on a little endian architecture):- src = prev >> shift; prev = *srcptr++; src |= prev << (BPW-shift); (BPW = Bits Per Word). Given that (BPW-shift) is ammenable to CSE by the compiler this is about four instructions (well, it's three instructions on the ARM; for those who understand the ARM assembler:- MOV src, prev, LSR shift LDR prev, [srcptr], #4 ORR src, src, prev, LSL BPWshift ) Conclusions? The general bit extract instruction is not necessary - the bits are (by careful arrangement of the algorithm) always at one end or the other of the word. Logical shifts by variable amounts are essential, being able to specify the amount in a register helps (incidentally doing this doubles the execution time on the ARM - four registers in the instruction rather than the normal three). Being able to combine shifts with logical operations helps slightly. The only way in which I can see a general bit extract helping is if an instruction of the form:- extract 32 bits from this register *PAIR* exists; then the above can be reduced to two instructions. Is this really worth it? This loop is for a general raster op (ie the loop does all the possible X raster ops, not just one of them) and for any source/destination alignment. On an ARM the loop amounts to 17 instructions of which 14 are arithmetic (logical), and 3 access memory. The whole algorithm can be expressed in C - not only are there no hardware ``tricks'' involved, there is no use of inaccessible machine operations such as rotate. Of these instructions only two involve bit shifts. Hence a further conclusion; even in code where bit level operations are (apparently) important they are still relatively infrequent. In the above loop (on the ARM) the shift instructions account for under one seventh of the total time. I'm not sure that any meaningful conclusions can be drawn from this, but, for comparison, the memory access takes over one third of the time and doing the raster op takes about one third. When the source is being copied to the destination and the mask is inactive (all ones) - the common case, the raster op is not necessary and one memory access can be saved (loading dst). In this case (again on the ARM) 6 instructions are needed and almost one quarter of the time goes in the shift operations. One half goes in memory access (the other quarter is in the two instructions which do the DBRA equivalent - no delayed branches!). This code is equivalent to optimised ARM code for bcopy/memcpy; so the case is important in other programs too. Only in this case is the overhead of bit twiddling or of loop handling starting to appear - and this is the most extreme case which I can think of. In practice (for memcpy) loop unwinding and special casing of the various shifts starts to pay off. But loop unwinding allows sequential memory access for load/store which is a big win, potentially reducing memory access time to a half of what it would otherwise be; forget the saving on loop instructions! Special casing the shifts halves the cost on the ARM - but this is a matter of having the instuctions go fast, not of having wacky instructions which do appropriate things. The real irony is that these *programming* techniques make the advantage of even the special purpose CISC byte string movement instructions relatively small; even non-existent if they are not actually implemented optimally (in particular if they do not take advantage of sequential mode access to DRAM when other instructions can). John Bowler (jbowler@acorn.co.uk)