Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sample.eng.ohio-state.edu!purdue!mentor.cc.purdue.edu!pop.stat.purdue.edu!hrubin From: hrubin@pop.stat.purdue.edu (Herman Rubin) Newsgroups: comp.arch Subject: Re: RISC vs. CISC -- SPECmarks Summary: Vector instructions are not just for FP Message-ID: <11812@mentor.cc.purdue.edu> Date: 3 May 91 13:33:20 GMT References: <11412@mentor.cc.purdue.edu> Sender: news@mentor.cc.purdue.edu Lines: 40 In article , mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: > >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said: > > Hugh> PROPHECY: One of these days, a single-chip microprocessor will > Hugh> have vector instructions, and then the advantages and > Hugh> disadvantages of various architectural decisions will be > Hugh> discovered all over again. > > I don't see much benefit to explicit vector instructions compared to > tight loops with zero cycle branches (like the RS/6000). They sure > can eat up a lot of silicon space, though.... > > The big problem is that the memory bandwidth required for vector FP is > expensive and is not likely to contribute substantially to the non-FP > performance. Without adequate memory bandwidth, there is not really > any need for vector instructions, since the cpu is idle (waiting for > cache refills) for plenty of time to do loop control.... It seems there are more operations than in your philosophy, John. Several years ago, I did a large editing job on a file of physical random numbers (the source file had some undesirable fixed zeros) on the CYBER 205. Now I doubt that the manufacturers had this type of operation in mind. The process itself was mainly done in 7 sets of vector instructions, segmented only because of length, each doing up to 2 reads and one write per pipe per half cycle. Thus the time, almost all vector time, was roughly 3.5 cycles, divided by the number of pipes, per word output. Now not all of the loads/stores would have been needed on a machine like the RS/6000. Assuming that there were at least 6, and preferably 7, large pages of cache available, the process could be done with 5 reads, 2 writes, 6 operations, and a nasty storage problem, which would have added about 2 operations per item. Of course there would have to be added the loop control operations. Even though the vector machine did roughly 12 loads and 7 stores per item, the time was certainly much less than for comparable speed hardware on a scalar machine. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)