Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sun-barr!newstop!exodus!rbbb.Eng.Sun.COM!chased From: chased@rbbb.Eng.Sun.COM (David Chase) Newsgroups: comp.arch Subject: Re: what if the i860 had vector registers... ? Message-ID: <13990@exodus.Eng.Sun.COM> Date: 24 May 91 19:39:36 GMT References: <1991May24.035756.5161@murdoch.acc.Virginia.EDU> Sender: news@exodus.Eng.Sun.COM Organization: Sun Microsystems, Mt. View, Ca. Lines: 55 gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) writes: >After reading a tech report on i860 performance (available from >uvacs.cs.virginia.edu: pub/techreports/ipc...), I was left with one >question: ... >So, what if the i860 instead had a small data cache, and pipelined >loads into vector registers? It is relatively easy to design a >compiler that can do a good job with vector registers. And you would >be able to do pipelined loads of data to be re-used. And you wouldn't >end up starved for registers. Can you be more explicit about what you mean by vector registers? I was about to say that using vector registers was about the same as calling library routines, but then I remembered the matrix multiplication example -- calling an "inner product" subroutine does not get the same performance as simultaneously computing three inner products, which requires that you know something about the algorithm surrounding the inner product. One constraint on elementary row operations on the i860 (implemented in an elementary way -- i.e., without blocking the algorithm within which the EROs are used) is that the FP unit is capable of consuming/producing 64 bits of off-chip (single precision) operands per cycle. (Actually, the grouping rules preclude attaining this rate; best you can do in that style is about 80% of speed-of-light). I'm using the model: offchip[i..n] := offchip[i..n] + factor * onchip[i..n] (row major, i.e., C/Pascal, array layout.) In a "naive" implementation of Gaussian elimination (and QR, too, I believe) the intent is to eliminate the column "i" from a collection of offchip rows by repeatedly subtracting a multiple of the onchip row from them. Each cycle, the i860 can complete one "off := off + F * on" operation, which means that (on average) 32 bits must come on per cycle and 32 bits must be stored per cycle. This is the "speed of light" for data on the i860 -- unless you use a wider bus, or a faster clocking rate, or something, vector registers won't get the data on and off chip any faster. Note -- max pipelined load width is 64 bits, max cached load width is 128 bits. An N-unrolled loop to do the elementary row op has N pipelined loads and stores, plus N/4 cached loads, plus one branch instruction. In double op mode, you pair this with N mpy-and-adds and N/4 + 1 fnops. Thus, in N + N/4 + 1 cycles you perform N mpy-and-adds at best, attaining a speed of N/(N + N/4 + 1) times C -- for N == 16, this is .76C. Note, too, that something as simple as a quad-word (128 bit) pipelined load (i.e., double the width of the off-chip bus) solves the speed-of-light problem without adding vector registers or operations. This still leaves the compiler working hard, of course. David Chase Sun