Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!uwm.edu!linac!att!ucbvax!agate!forney.berkeley.edu!jbuck From: jbuck@forney.berkeley.edu (Joe Buck) Newsgroups: comp.arch Subject: Re: More on Linpack pivoting: isamax and instruction set design Message-ID: <1991Jun14.010226.11981@agate.berkeley.edu> Date: 14 Jun 91 01:02:26 GMT References: <396@validgh.com> Sender: usenet@agate.berkeley.edu (USENET Administrator) Reply-To: jbuck@forney.berkeley.edu (Joe Buck) Organization: University of California, Berkeley Lines: 46 In article <396@validgh.com>, dgh@validgh.com (David G. Hough on validgh) writes: |> [ what architectural features speed this up? ] !> |> do 30 i = 2,n |> if(abs(dx(i)).le.dmax) go to 30 |> isamax = i |> dmax = abs(dx(i)) |> 30 continue DSP chips are good for things like this. The following routine takes 10+3N cycles (60 nsec/cycle for the original C30): ldi dx,ar0 ; ar0 points to the data ldi @n,rc ; vector length subi 1,rc ; n-1 to get n loops ldf -1.0,r1 ; set max abs value to -1 rptb loop ; start zero overhead loop ;........................... absf *ar0++,r0 ; r0 = absval of dx[i] cmpf r0,r1 ; larger than max? loop: ldigt rc,r2 ; if so, mark its position ;........................... ; rc is decremented once each time -- it's n-1 if the first term is ; the max, n-2 if the second, etc. So n-rc would be the isamax ; output of a Fortran routine. ldi @n,r0 subi rc,r0 ; now r0 has isamax and r1 has dmax. (extra instructions needed ; to do a C call interface). Several elements contribute to speed: the zero-overhead loop, the conditional load (ldigt), the absolute value instruction, and (sorry, purists) the autoincrement addressing mode. The RS/6000 already has, in many cases, zero-overhead loops. I have found, though, that conditional loads are a big win on heavily pipelined machines where a branch would cause a large pipeline penalty. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck