Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sample.eng.ohio-state.edu!purdue!mentor.cc.purdue.edu!pop.stat.purdue.edu!hrubin
From: hrubin@pop.stat.purdue.edu (Herman Rubin)
Newsgroups: comp.arch
Subject: Re: RISC vs. CISC -- SPECmarks
Summary: Vector instructions are not just for FP
Message-ID: <11812@mentor.cc.purdue.edu>
Date: 3 May 91 13:33:20 GMT
References: <TH_A6-F@xds13.ferranti.com> <11412@mentor.cc.purdue.edu> <MCCALPIN.91May2095930@pereland.cms.udel.edu>
Sender: news@mentor.cc.purdue.edu
Lines: 40

In article <MCCALPIN.91May2095930@pereland.cms.udel.edu>, mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
> >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:
> 
> Hugh> PROPHECY: One of these days, a single-chip microprocessor will
> Hugh> have vector instructions, and then the advantages and
> Hugh> disadvantages of various architectural decisions will be
> Hugh> discovered all over again.
> 
> I don't see much benefit to explicit vector instructions compared to
> tight loops with zero cycle branches (like the RS/6000).  They sure
> can eat up a lot of silicon space, though....
> 
> The big problem is that the memory bandwidth required for vector FP is
> expensive and is not likely to contribute substantially to the non-FP
> performance.  Without adequate memory bandwidth, there is not really
> any need for vector instructions, since the cpu is idle (waiting for
> cache refills) for plenty of time to do loop control....

It seems there are more operations than in your philosophy, John.

Several years ago, I did a large editing job on a file of physical random
numbers (the source file had some undesirable fixed zeros) on the CYBER 205.
Now I doubt that the manufacturers had this type of operation in mind.  The
process itself was mainly done in 7 sets of vector instructions, segmented
only because of length, each doing up to 2 reads and one write per pipe per
half cycle.  Thus the time, almost all vector time, was roughly 3.5 cycles,
divided by the number of pipes, per word output.

Now not all of the loads/stores would have been needed on a machine like
the RS/6000.  Assuming that there were at least 6, and preferably 7, large
pages of cache available, the process could be done with 5 reads, 2 writes,
6 operations, and a nasty storage problem, which would have added about 2
operations per item.  Of course there would have to be added the loop control
operations.  Even though the vector machine did roughly 12 loads and 7 stores
per item, the time was certainly much less than for comparable speed hardware
on a scalar machine.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)