Xref: utzoo comp.sys.atari.st:12866 comp.os.minix:4281
Path: utzoo!attcan!uunet!mcvax!tnoibbc!hp4nl!philmds!leo
From: leo@philmds.UUCP (Leo de Wit)
Newsgroups: comp.sys.atari.st,comp.os.minix
Subject: Re: improved long multiply for GCC -m68000
Message-ID: <889@philmds.UUCP>
Date: 11 Dec 88 18:39:56 GMT
References: <22486@watmath.waterloo.edu> <883@philmds.UUCP>
Reply-To: leo@philmds.UUCP (Leo de Wit)
Distribution: comp
Organization: Philips I&E DTS Eindhoven
Lines: 241

After doing some homework I can now present a long multiply that is
time efficient also in cases of small negative operands, and only needs
one or two unsigned hardware multiplies plus additional overhead.

Since the discussion also started here I post it to this newsgroup
(instead of submitting to comp.sources); besides I'm having trouble
getting stuff posted in that group (still waiting to see 'nlqps2' and
'life' appear in the binaries group, Steven!).  Take it ('s') or leave
it ('n').

I tested it for all combinations of operands that are positive or
negative powers of two, and furthermore for a number of random cases.
This revealed no errors (except of course the cases that overflowed).
Enjoy!
          Leo.

N.B. Assemble with GST Assembler, or modify to suit your favorite asm.
-------------- s t a r t   o f   m u l l i ( ) ----------------------
******************************************************************************
*                                                                            *
*    mulli.asm  version 1.0 of 12 December 1988   (C) L.J.M. de Wit 1988     *
*                                                                            *
* This software may be used and distributed freely if not used commercially  *
* and the originator (me) is mentioned in the source (just leave this 9 line *
* header intact).                                                            *
*                                                                            *
******************************************************************************
*
* NAME
*    (u)mulli - improved long (un)signed multiply
*
* SYNTAX
*    long mulli(a,b) long a, b;
*    unsigned long umulli(a,b) unsigned long a, b;
*
* DESCRIPTION
*    Mulli was designed after a proposal presented by Eric Gisin to provide a
*    fast long multiply (signed & unsigned); thanks, Eric!
*    Like in the original, since a non-sign extending multiply is the same for
*    signed and unsigned cases, one version suffices. The drawback is that
*    there is no overflow detection, because this depends on the case, e.g.
*
*        0x00000010 x 0x01000000 == 0x10000000
*
*    which is correct in the unsigned, but overflows in the signed case.
*    IN CASE OF OVERFLOW, NO ASSUMPTIONS SHOULD BE MADE UPON THE RESULT.
*
*    A long x long -> long multiply can be written as follows (p,q,r,s
*    are 16 bit numbers):
*
*                     High word    Low word
*    First operand:     q           p
*    Second operand:    s           r
*    --------------------------------------------- *
*    (q * s << 32) + (q * r + p * s << 16) + p * r
*
*    Since the result is a 32 bit number, this can be written:
*        ((LOW(q * r) + LOW(p*s)) << 16) + p * r
*    or
*        MED(q * r) + MED(p*s) + p * r.
*    where LOW() denotes the low order word (16 bits) of a 16 or 32 bit number,
*    and MED() is this low order word shifted 16 places (so a 32 bit number).
*    This demands effectively three (or less) multiplies (and some additional
*    stuff like shifts and extended adds).
*    It is clear that whenever the high word part of an input operand is 0,
*    as in the case of small positive numbers, some part of the result does not
*    need to be calculated and so one multiply less is required (the original
*    used this in the case both operands had their high parts 0, so that the
*    operation degrades to p * r).
*    For instance, if q == 0, the operation degrades to
*        MED(p*s) + p * r.
*    An other simplification can be made for small negative numbers, that is,
*    when the high part of an operand is 0xffff. For instance, when q == 0xffff
*    the part involving q can be written:
*        LOW(0xffff * r) << 16  ==
*        LOW(0x10000 * r - r) << 16 ==
*        LOW(-r) << 16 ==
*        MED(-r) ==
*        -MED(r)
*    So in this case we can also save a multiply.
*    As in the original, shifting by 16 is accomplished by swapping a register
*    and clearing the low order word, which is faster than doing a
*        asl.l #16,d0.
*    Since the result goes into a 32 bit longword, add is used instead of addx
*    for addition of partial results (the carry would go into the next longword
*    anyway).
*    The routine splits up into the combinations of the following cases:
*        a) the high word of an operand is 0 (0x0000).
*        b) the high word of an operand is -1 (0xffff).
*        c) the high word of an operand is neither 0 nor -1.
*    which amounts to 9 different cases (of which one will always overflow and
*    is not split up as a separate case).
*
* SPEED
*    As in the original, register d2 is used as an alternative calculation
*    register and its contents saved in a1. In the case of small x small
*    multiplication, there is no need for d2 and we save 12 ticks by not using
*    it. In the other cases we save, depending on the case, from 44 to 148
*    ticks (with respect to the original). This effectively means that all 
*    small x small multiplies with at least one negative operand are now twice
*    as fast.
*
*    Further speed increases could be gained by:
*    a) Passing the operands in registers. Lattice C uses this default
*        for its builtin multiply (CXM33 for signed and CXM22 for unsigned
*        multiply). This saves 2 x 16 == 32 ticks in the callee
*            (move.l 4(sp),d0 and move.l 8(sp),d1)
*        and 2 x 12 + 4 == 28 ticks in the caller
*            (move.l d1,-(sp) , move.l d2,-(sp) and addq.l #8,sp)
*        , thus gaining 60 ticks!
*    b) Having d2 as a scratch register. Lattice C does not need to save
*        the d2 register, since it is used as a scratch (temporary, calculation)
*        register. This means that all move.l a1,d2 and move.l d2,a1 sequences
*        can be removed. Since most cases require one of each (see code), this
*        saves 2 x 4 == 8 ticks on these cases.
*    c) Inlining.
*    d) Whatever shrewd idea you might have. I'm very interested!
*
* BUGS/SHORTCOMINGS
*    Since the code was created with speed as main motive, no objections should
*    be made against e.g. the code having multiple exit points (this saves us
*    another 12 ticks!) and having separate regions for parts that could easily
*    be shared (again saving branch instructions).
*
*    No overflow detection, and incorrect results in case of overflow.
*
*    Care has been taken to use meaningless labels 8-).
*
* NOTATION
*    The notation Dreg = (a,b) is used to mean Dreg.hi = a, Dreg.lo = b.
*

   section s.ccode

   xdef mulli
   xdef umulli

mulli
umulli
   move.l   4(sp),d0    ; q = d0.hi, p = d0.lo, d0 = (q,p)
   move.l   8(sp),d1    ; s = d1.hi, r = d1.lo, d1 = (s,r)
   swap     d0          ; d0 = (p,q)
   tst.w    d0          ;
   bne.s    alfa        ;
   swap     d0          ;
   swap     d1          ;
   tst.w    d1          ;
   bne.s    beta        ;
   swap     d1          ; d0 = (0,p), d1 = (0,r), q = 0, s = 0
   mulu.w   d1,d0       ; d0 = p * q
   rts                  ; case: q == 0 && s == 0

alfa
   move.l   d2,a1       ; save d2
   addq.w   #1,d0       ; d0 = (p,q+1), d1 = (s,r)
   bne.s    gamma       ;
   swap     d0          ; d0 = (0,p), q = -1
   move.w   d0,d2       ; 
   mulu.w   d1,d2       ; d2 = p * r
   swap     d1          ;
   tst.w    d1          ;
   bne.s    delta       ;
   move.l   d2,d0       ; d0 = p * r, d1 = (r, 0), s = 0
   sub.l    d1,d0       ; d0 = p * r - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == 0

delta
   addq.w   #1,d1       ; d0 = (0,p), d1 = (r,s+1), d2 = p * r, q = -1
   bne.s    epsilon     ;
   swap     d0          ; d0 = (p,0), d1 = (r,0), s = 0
   sub.l    d0,d2       ; d2 = p * r - MED(p)
   sub.l    d1,d2       ; d2 = p * r - MED(p) - MED(r)
   move.l   d2,d0       ; d0 = p * r - MED(p) - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == -1

epsilon
   subq.w   #1,d1       ; d0 = (0,p), d1 = (r,s), d2 = p * r
   mulu.w   d1,d0       ; d0 = p * s
   swap     d0          ;
   clr.w    d0          ; d0 = MED(p * s)
   clr.w    d1          ; d1 = MED(r)
   add.l    d0,d2       ; d2 = p * r + MED(p * s)
   sub.l    d1,d2       ; d2 = p * r + MED(p * s) - MED(r)
   move.l   d2,d0       ; d0 = p * r + MED(p * s) - MED(r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == -1 && s == n

gamma
   subq.w   #1,d0       ; d0 = (p,q), d1 = (s,r)
   move.w   d0,d2       ;
   mulu.w   d1,d2       ; d2 = q * r
   swap     d0          ; d0 = (q,p)
   tst.l    d1          ; 
   bmi.s    zeta        ;
   mulu.w   d1,d0       ; d0 = p * r, s >= 0
   swap     d2          ;
   clr.w    d2          ; d2 = MED(q * r)
   add.l    d2,d0       ; d0 = p * r + MED(q * r)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == n && (s == 0 || s == n)
; note that MED(p,s) is ignored since if s != 0 there will be overflow anyway
; and thus s is assumed to be 0

zeta
   mulu.w   d0,d1       ; d0 = (q,p), d1 = p * r, d2 = q * r, s < 0
   sub.w    d0,d2       ; 
   swap     d2          ;
   clr.w    d2          ; d2 = MED(q * r) - MED(p)
   add.l    d2,d1       ; d1 = p * r + MED(q * r) - MED(p)
   move.l   d1,d0       ; d0 = p * r + MED(q * r) - MED(p)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == n && (s == -1 || s == n)
; note that MED(p,s) is ignored since if s != -1 there will be overflow anyway
; and thus s is assumed to be -1

beta
   addq.w   #1,d1       ; d0 = (0,p), d1 = (r,s+1), q = 0
   bne.s    theta       ;
   swap     d1          ; d1 = (0,r), s = -1
   mulu.w   d0,d1       ; d1 = p * r
   swap     d0          ; d0 = MED(p)
   sub.l    d0,d1       ; d1 = p * r - MED(p)
   move.l   d1,d0       ; d0 = p * r - MED(p)
   rts                  ; case: q == 0 && s == -1

theta
   subq.w   #1,d1       ; d0 = (0,p), d1 = (r,s), q = 0
   move.l   d2,a1       ; save d2
   move.w   d1,d2       ; 
   mulu.w   d0,d2       ; d2 = p * s
   swap     d1          ; d1 = (s,r)
   mulu.w   d1,d0       ; d0 = p * r
   swap     d2          ;
   clr.w    d2          ; d2 = MED(p * s)
   add.l    d2,d0       ; d0 = p * r + MED(p * s)
   move.l   a1,d2       ; restore d2
   rts                  ; case: q == 0 && s == n

   end