Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!novavax!midas!mkraiesk From: mkraiesk@midas.UUCP (Mark Kraieski) Newsgroups: comp.arch Subject: Re: Vectorizing division in a do loop Message-ID: <3745@midas.UUCP> Date: 25 Aug 89 15:30:27 GMT References: <112400003@uxa.cso.uiuc.edu> Organization: Gould CSD, Fort Lauderdale, FL Lines: 67 in article <112400003@uxa.cso.uiuc.edu>, gsg0384@uxa.cso.uiuc.edu says: > Nf-ID: #N:uxa.cso.uiuc.edu:112400003:000:698 > Nf-From: uxa.cso.uiuc.edu!gsg0384 Aug 16 21:38:00 1989 > > > > Hi, > > I've heard that vector machines are more expensive than multi-cpu parallel > machines. I've got two questions about vector machines. I was involved in the compiler work for Gould's NP1 vector computer and would like to shed some light on these questions. > > 1. For compiler design, I think vector machine architecture is easier. > Is this true? Not that not all code can be vectorized, in fact, vectorization is just a subset of all parallel optimizations. A vector compiler must handle all the normal scalar operations PLUS the vector operations. Also, unless the vectorization is syntax driven, a vectorizer (such as those from PSR and KAI) must be used to determine when it is safe to vectorize. > > 2. Our machine is Ardent Titan. Each cpu has 64-register length vector > registers. The problem is that this machine does not vectorize do loops > with division. How much harder is to implement division than the other > three operations, + - x? Is this a hardware limitation? I am not familiar with Ardent's box but of the 4 operations, division takes the most time. Our original architecture used a recipricol/multiply instead of divide since we could chain the operations and do it faster. My guess is that the gates required to do a fast divide were not available on the Titan so they just punted. But note that a compiler could vectorize the portion of the expression that doesn't do division for a potential speed up. For example, given the array operation A = (B+C+D) / E we can make 2 loops, one vector and one scalar: A = B+C+D A = A/E Depending on the vector loop overhead and the number of elements (and size of cache if we want to get technical), the partially vectorized loop can outperform an unvectorized version. Note that vector operations are not faster than scalar - a multiply takes the same in either case. But for scalar code we have to execute multiple instructions to set us up for the next operation while vector code has just one instruction fetch and no branches (if the array is less than the vector length). > > I'd rather have a vector machine than a multi-cpu parallel machine for > my application. I just want more people in computer industries to pay > more attention to vector machines. > > Hugh I'd rather have a multi-cpu system where each cpu has vector capabilities so my fine grain parallelism is done within a node but other non-vector- izable code could still benefit from multiple processors. | "Far better it is to dare mighty things, to Mark E. Kraieski | win glorious triumphs, even though checkered Encore Computer Corp., MS#404 | by failure, than to take rank with those Computer Performance Evaluation | poor spirits who neither enjoy much nor 6901 West Sunrise Blvd. | suffer much, because they live in the gray Ft. Lauderdale, FL 33313-4499 | twilight that knows not victory or defeat." (305) 587-2900 | -- Theodore Roosevelt