Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!sdd.hp.com!wuarchive!uunet!mcsun!hp4nl!charon!dik
From: dik@cwi.nl (Dik T. Winter)
Newsgroups: comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Message-ID: <3485@charon.cwi.nl>
Date: 7 May 91 23:41:39 GMT
References: <819@cadlab.sublink.ORG> <1991May7.061500.7485@marlin.jcu.edu.au> <1991May7.150724.18806@midway.uchicago.edu>
Sender: news@cwi.nl
Organization: CWI, Amsterdam
Lines: 77

In article <1991May7.150724.18806@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
 >                    The pipeline only means that each pipe spits out
 > a float result each cycle (or two, in the case of a mul-add).  Super
 > scalar machines can also produce a result per cycle.
See below.
 > 
 > What I am confused about is why superscalar machines aren't seen as
 > clearly superceding vector architectures.  Like vector architetures,
 > they use instruction overlap to produce a result (or two) per cycle.
As noted, vector hardware is simpler to design than super-scalar. That is
why we see super-vector.  The 205 could have up to 4 pipes, producing
4 resuls (8 floating ops) each cycle.  The same for the NEC SX-3.  It will
take some time before super-scalar can do that.  (And consider that the
clock of the SX-3 runs at 2.6 nsec = 385 MHz; and that he can be configured
multi-processor in the future.)

 > The difference is that the compiler or the programmer must
 > arrange things so that the proper overlap is possible, whereas with
 > the vector machines you just issue a single vector instruction, i.e.
 > the particular kind of instruction overlap is hard-wired into the
 > silicon (or GaAs).
Still silicon, and NEC thinks they are able to push silicon still further;
at least, as far as I know, they are not yet thinking about GaAs.

 >                     That would seem to make vector architectures
 > clearly less versatile than superscalar.
True enough.  But there is no reason that a super-vector machine could not
be super-scalar too; which would give them a good boost in performance for
the (still important) scalar part.  (I have seen only one program where
after full vectorization still >80% of the time was spent in vector operations.)

 >                                A notable exception is none of 
 > them I have ever seen will automatically do things like strip
 > mining and unroll-and-jam for you.
The Alliant compiler for the i860 does.
 > 
 > As far as I can see, insofar as current vector computers have
 > some advantages over superscalar, the performance differences have
 > more to do with memory bandwidth than processor architecture.  I'd
 > be happy to hear other comments on this, though.
As noted, memory bandwidth is not the only factor.  But it is of course
an important factor.  I do not know the bus width from memory to CPU
on the SX-3, but on the 205 it is lots of bits (the basic piece of information
going from memory to the CPU was a 'super-word' of 1024 bits).  Other
important factors are memory size (256 64-bit Mword or more) and disk I/O.
And of course: no cache, thank you.

What bothers me is that the super-scalar machines I know (i860 and RS6000)
go away from some basic (RISCy) principles to get their f-p performance.
The i860 operations are sufficiently strange that you need to know
everything about memory access times etc. to get good performance.  If
you do not know that, your performance will be mediocre at best.  I tried
it, I got reasonable performance, but now that I have more specific
knowledge about memory on the machine in question, I know that I ought to
have coded completely different.  The RS6000 is simpler in some ways (no
visible pipelines as on the i860), but on the other hand more difficult:
you have to know exact timing information for the instructions to get the
pipeline going.  And I do not think this information will remain the same
with future models.

But the biggest problems with super-scalar machines to get vector performance
is the limited number of registers.  32 fp registers on both i860 and RS6000.
You need to issue your loads in advance and you need to issue your stores
delayed to get performance.  On the i860 it is extremely difficult to
allocate your registers such that you would have no interference.  On the
RS6000 register renaming helps a bit (39 rather than 32 registers), but
also there full speed loops require extremely careful allocation (and I have
that suspicion that register renaming only makes it less visible).  Compare
that to the Cray (8 vector registers of 64 elements) and the SX-3 (for the
SX-2 it was 32 vector registers of 256 elements, I expect about the same on
the SX-3).  I feel already a bit cramped on the Cray.

But rest assured.  The results will be more correct on your garden variety
micro.  F-p precision on those supers is nothing to write home about.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl