Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!kpc.com!ardent!mac
From: mac@gold.kpc.com (Mike McNamara)
Newsgroups: comp.arch
Subject: Re: Branchless conditionals (Was: More on Linpack pivoting)
Message-ID: <MAC.91Jun21094148@gold.kpc.com>
Date: 21 Jun 91 16:41:48 GMT
References: <396@validgh.com> <1991Jun13.234834.22970@neon.Stanford.EDU>
	<1991Jun14.134338.4673@linus.mitre.org>
	<1991Jun14.173510.22510@dg-rtp.dg.com>
	<GLEW.91Jun16151503@pdx007.intel.com>
Sender: uucp@kpc.com (UNIX-to-UNIX Copy)
Reply-To: mac@ardent.com (Michael McNamara)
Organization: Kubota Pacific Computer Incoporated, Santa Clara, CA
Lines: 71
In-Reply-To: glew@pdx007.intel.com's message of 16 Jun 91 22:15:03 GMT
Nntp-Posting-Host: gold


|> :	    ibig  = absdx .le. dmax
|> 	    ^^^^^^^^^^^^^^^^^^^^^^^^
|> 
|> I would be very interested in seeing the assembler code that gets
|> emitted for this line of Fortran. How can this statement get executed
|> WITHOUT a branch??

	Firstly, the MIPS instruction set has SLT, SLTI, SLTU and
SLTIU, which set to 1 the destination register if the first argument
is less than the second. Otherwise the destination is set to 0.  U
requests an unsigned comparison; I indicates the second argument is a
16 bit immediate.  
	Your code snippet does not contain declarations, but one is
probably correct in assuming that you are implicitly allowing absdx
and dmax to be typed as REALS, inwhich case the MIPS instructions
would not work (integer only)

	Most vector machines, (Ardent Titan, Convex, etc) have a
vector compare instruction that sets or clears bits in any specified
register based upon the compare; further, other operations can be
specified to operate "undermask" of that register.  This facilitates
vectorization of loops containing conditionals:

	DO I = 1,N
		IF ( X(I) .GT. Y(I) )
			X(I) = Y(I) * Z(I) + K
		ENDIF
	ENDDO

	becomes
		sw	N,<vlength>
		DVLDA	X,0(addr_X)
		DVLDB	Y,0(addr_Y)
		DVCGT   [M],X,Y
		DVLDA	Z,0(addr_Z)     ! could be under mask as well...
		DVMA,mt X,X,Y,[K]
		DVST,mt	X,0(addr_X)

	on an ardent titan.

	Ah, yes, Andy brings up the beloved cydrome machine.  I
introduced the intrinsic SELECT(EXP,A,B) to it's fortran compiler.
*Amazing* speed ups were achieved especially on loops with internal
conditional codes.  e.g., vector intrinsics.  I was able to get a 17x
improvement in ATAN2.

	I find myself today still "thinking" in the predicated vliw
paradigm.  To be fair, the hardware model the cydra provided
facilitates the software ideas in the control flow graph papers from
IBM.  Those of you at the most recent ASPLOS saw how the cydrome
architecture implements software pipelining in hardware; the
predicated operations greatly facilitate the control graph
optimizations as well.  The one incompleteness of the cydra was
predicated exceptions.  
	I.E., this instruction is PROMPTED: i.e., been moved out
of it's containing block, so if it generates an exception, just set a
bit in the destination register.  When a non promoted instruction uses
a value from a register that has the exception bits set, THEN generate
the fault...

	Ah... musings...

	-mac 
	

--
+-----------+-----------------------------------------------------------------+
|mac@kpc.com| Increasing Software complexity lets us sell Mainframes as       |
|           | personal computers. Carry on, X windows/Postscript/emacs/CASE!! |
+-----------+-----------------------------------------------------------------+