Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!exodus!rbbb.Eng.Sun.COM!chased
From: chased@rbbb.Eng.Sun.COM (David Chase)
Newsgroups: comp.arch
Subject: Re: what if the i860 had vector registers... ?
Message-ID: <13990@exodus.Eng.Sun.COM>
Date: 24 May 91 19:39:36 GMT
References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU>
Sender: news@exodus.Eng.Sun.COM
Organization: Sun Microsystems, Mt. View, Ca.
Lines: 55

gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) writes:
>After reading a tech report on i860 performance (available from
>uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one
>question:
...
>So, what if the i860 instead had a small data cache, and pipelined
>loads into vector registers? It is relatively easy to design a
>compiler that can do a good job with vector registers. And you would
>be able to do pipelined loads of data to be re-used. And you wouldn't
>end up starved for registers.

Can you be more explicit about what you mean by vector registers?  I
was about to say that using vector registers was about the same as
calling library routines, but then I remembered the matrix
multiplication example -- calling an "inner product" subroutine does
not get the same performance as simultaneously computing three inner
products, which requires that you know something about the algorithm
surrounding the inner product.

One constraint on elementary row operations on the i860 (implemented
in an elementary way -- i.e., without blocking the algorithm within
which the EROs are used) is that the FP unit is capable of
consuming/producing 64 bits of off-chip (single precision) operands
per cycle.  (Actually, the grouping rules preclude attaining this rate;
best you can do in that style is about 80% of speed-of-light). I'm
using the model:

  offchip[i..n] := offchip[i..n] + factor * onchip[i..n]

(row major, i.e., C/Pascal, array layout.)  In a "naive"
implementation of Gaussian elimination (and QR, too, I believe) the
intent is to eliminate the column "i" from a collection of offchip
rows by repeatedly subtracting a multiple of the onchip row from them.
Each cycle, the i860 can complete one "off := off + F * on" operation,
which means that (on average) 32 bits must come on per cycle and 32
bits must be stored per cycle.  This is the "speed of light" for data
on the i860 -- unless you use a wider bus, or a faster clocking rate,
or something, vector registers won't get the data on and off chip any
faster.

Note -- max pipelined load width is 64 bits, max cached load width is
128 bits.  An N-unrolled loop to do the elementary row op has N
pipelined loads and stores, plus N/4 cached loads, plus one branch
instruction.  In double op mode, you pair this with N mpy-and-adds and
N/4 + 1 fnops.  Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds
at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16,
this is .76C.

Note, too, that something as simple as a quad-word (128 bit) pipelined
load (i.e., double the width of the off-chip bus) solves the
speed-of-light problem without adding vector registers or operations.
This still leaves the compiler working hard, of course.

David Chase
Sun