Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!kpc.com!ardent!mac From: mac@gold.kpc.com (Mike McNamara) Newsgroups: comp.arch Subject: Re: Branchless conditionals (Was: More on Linpack pivoting) Message-ID: Date: 21 Jun 91 16:41:48 GMT References: <396@validgh.com> <1991Jun13.234834.22970@neon.Stanford.EDU> <1991Jun14.134338.4673@linus.mitre.org> <1991Jun14.173510.22510@dg-rtp.dg.com> Sender: uucp@kpc.com (UNIX-to-UNIX Copy) Reply-To: mac@ardent.com (Michael McNamara) Organization: Kubota Pacific Computer Incoporated, Santa Clara, CA Lines: 71 In-Reply-To: glew@pdx007.intel.com's message of 16 Jun 91 22:15:03 GMT Nntp-Posting-Host: gold |> : ibig = absdx .le. dmax |> ^^^^^^^^^^^^^^^^^^^^^^^^ |> |> I would be very interested in seeing the assembler code that gets |> emitted for this line of Fortran. How can this statement get executed |> WITHOUT a branch?? Firstly, the MIPS instruction set has SLT, SLTI, SLTU and SLTIU, which set to 1 the destination register if the first argument is less than the second. Otherwise the destination is set to 0. U requests an unsigned comparison; I indicates the second argument is a 16 bit immediate. Your code snippet does not contain declarations, but one is probably correct in assuming that you are implicitly allowing absdx and dmax to be typed as REALS, inwhich case the MIPS instructions would not work (integer only) Most vector machines, (Ardent Titan, Convex, etc) have a vector compare instruction that sets or clears bits in any specified register based upon the compare; further, other operations can be specified to operate "undermask" of that register. This facilitates vectorization of loops containing conditionals: DO I = 1,N IF ( X(I) .GT. Y(I) ) X(I) = Y(I) * Z(I) + K ENDIF ENDDO becomes sw N, DVLDA X,0(addr_X) DVLDB Y,0(addr_Y) DVCGT [M],X,Y DVLDA Z,0(addr_Z) ! could be under mask as well... DVMA,mt X,X,Y,[K] DVST,mt X,0(addr_X) on an ardent titan. Ah, yes, Andy brings up the beloved cydrome machine. I introduced the intrinsic SELECT(EXP,A,B) to it's fortran compiler. *Amazing* speed ups were achieved especially on loops with internal conditional codes. e.g., vector intrinsics. I was able to get a 17x improvement in ATAN2. I find myself today still "thinking" in the predicated vliw paradigm. To be fair, the hardware model the cydra provided facilitates the software ideas in the control flow graph papers from IBM. Those of you at the most recent ASPLOS saw how the cydrome architecture implements software pipelining in hardware; the predicated operations greatly facilitate the control graph optimizations as well. The one incompleteness of the cydra was predicated exceptions. I.E., this instruction is PROMPTED: i.e., been moved out of it's containing block, so if it generates an exception, just set a bit in the destination register. When a non promoted instruction uses a value from a register that has the exception bits set, THEN generate the fault... Ah... musings... -mac -- +-----------+-----------------------------------------------------------------+ |mac@kpc.com| Increasing Software complexity lets us sell Mainframes as | | | personal computers. Carry on, X windows/Postscript/emacs/CASE!! | +-----------+-----------------------------------------------------------------+