Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!think.com!sdd.hp.com!wuarchive!uunet!mcsun!hp4nl!charon!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: RISC vs. CISC -- SPECmarks Message-ID: <3485@charon.cwi.nl> Date: 7 May 91 23:41:39 GMT References: <819@cadlab.sublink.ORG> <1991May7.061500.7485@marlin.jcu.edu.au> <1991May7.150724.18806@midway.uchicago.edu> Sender: news@cwi.nl Organization: CWI, Amsterdam Lines: 77 In article <1991May7.150724.18806@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: > The pipeline only means that each pipe spits out > a float result each cycle (or two, in the case of a mul-add). Super > scalar machines can also produce a result per cycle. See below. > > What I am confused about is why superscalar machines aren't seen as > clearly superceding vector architectures. Like vector architetures, > they use instruction overlap to produce a result (or two) per cycle. As noted, vector hardware is simpler to design than super-scalar. That is why we see super-vector. The 205 could have up to 4 pipes, producing 4 resuls (8 floating ops) each cycle. The same for the NEC SX-3. It will take some time before super-scalar can do that. (And consider that the clock of the SX-3 runs at 2.6 nsec = 385 MHz; and that he can be configured multi-processor in the future.) > The difference is that the compiler or the programmer must > arrange things so that the proper overlap is possible, whereas with > the vector machines you just issue a single vector instruction, i.e. > the particular kind of instruction overlap is hard-wired into the > silicon (or GaAs). Still silicon, and NEC thinks they are able to push silicon still further; at least, as far as I know, they are not yet thinking about GaAs. > That would seem to make vector architectures > clearly less versatile than superscalar. True enough. But there is no reason that a super-vector machine could not be super-scalar too; which would give them a good boost in performance for the (still important) scalar part. (I have seen only one program where after full vectorization still >80% of the time was spent in vector operations.) > A notable exception is none of > them I have ever seen will automatically do things like strip > mining and unroll-and-jam for you. The Alliant compiler for the i860 does. > > As far as I can see, insofar as current vector computers have > some advantages over superscalar, the performance differences have > more to do with memory bandwidth than processor architecture. I'd > be happy to hear other comments on this, though. As noted, memory bandwidth is not the only factor. But it is of course an important factor. I do not know the bus width from memory to CPU on the SX-3, but on the 205 it is lots of bits (the basic piece of information going from memory to the CPU was a 'super-word' of 1024 bits). Other important factors are memory size (256 64-bit Mword or more) and disk I/O. And of course: no cache, thank you. What bothers me is that the super-scalar machines I know (i860 and RS6000) go away from some basic (RISCy) principles to get their f-p performance. The i860 operations are sufficiently strange that you need to know everything about memory access times etc. to get good performance. If you do not know that, your performance will be mediocre at best. I tried it, I got reasonable performance, but now that I have more specific knowledge about memory on the machine in question, I know that I ought to have coded completely different. The RS6000 is simpler in some ways (no visible pipelines as on the i860), but on the other hand more difficult: you have to know exact timing information for the instructions to get the pipeline going. And I do not think this information will remain the same with future models. But the biggest problems with super-scalar machines to get vector performance is the limited number of registers. 32 fp registers on both i860 and RS6000. You need to issue your loads in advance and you need to issue your stores delayed to get performance. On the i860 it is extremely difficult to allocate your registers such that you would have no interference. On the RS6000 register renaming helps a bit (39 rather than 32 registers), but also there full speed loops require extremely careful allocation (and I have that suspicion that register renaming only makes it less visible). Compare that to the Cray (8 vector registers of 64 elements) and the SX-3 (for the SX-2 it was 32 vector registers of 256 elements, I expect about the same on the SX-3). I feel already a bit cramped on the Cray. But rest assured. The results will be more correct on your garden variety micro. F-p precision on those supers is nothing to write home about. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl