Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!ptimtc!nntp-server.caltech.edu!iago.caltech.edu!rmarq From: rmarq@iago.caltech.edu (Marquardt, Ron R.) Newsgroups: comp.arch Subject: Re: Vector vs Cache/Superscalar Message-ID: <1991May4.182729.20523@nntp-server.caltech.edu> Date: 4 May 91 17:27:45 GMT References: <1991May4.031835.7979@midway.uchicago.edu <11875@mentor.cc.purdue.edu> Sender: news@nntp-server.caltech.edu Reply-To: rmarq@iago.caltech.edu Organization: California Institute of Technology Lines: 53 News-Software: VAX/VMS VNEWS 1.3-4x In article <11875@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes... [regarding previous comments contrasting the RS/6000 superscalar architecture with the Stardent vector architecture] >> (b) Tridiagonal solving. Comes up in lots of codes, and it is >> a real vector-breaker. In fact, vector machines choke on all >> sorts of recursion, whereas the superscalars just love them. >> On the RS/6000, the tridiag code basically vanished, whereas on >> the vector Stardent, it was a bottleneck. > >There are tricky ways of doing this efficiently on vector machines, >especially flexible ones. This uses partitioning. > Our experience with these machines (we have two IBM RS/6000's, a 530 and a 540, and three of what are now Stardent 3000's) has shown that for matrix problems exceeding 200x200, tridiagonal or otherwise, the Stardents can be considerably faster than the RS/6000's. The relative performance difference increases with the size of the problem and can reach as great as a factor of 10 (for naively written code). This problem can be traced to the small TLB (mapping 512K) on the IBM machines. Back when Stardent was Ardent, they had similar problems, which were more easily fixed since their TLB is external. The Stardent ETLBs now can map >128 Mb (and I believe it goes as high as the maximum allowed memory in the system, which escapes me). For an admittingly simple benchmark, a traditionally coded matrix multiply, the 530 is actually SLOWER than a Sparc 1, and the 540 only marginally faster, for 701x701 [530: 568 sec., Sparc 1: 525 sec., and 540: 470 sec.]. Before people complain too loudly, it is very easy to recode a matrix multiply for better efficiency and get the 540 time down to 49 sec., BUT it is not necessarily true that for all matrix routines such a recoding is simple, or even possible. Better compiler technology would certainly help, and IBM's latest beta version for the RS/6000's (exact version ID escapes me) promises a great deal of improvement. For arbitrarily complex and general problems, however, a larger TLB would certainly benefit the RS/6000 family. By the way, although they are not superscalar, the latest HP workstations seem to be avoiding the TLB problems of the IBMs. Included in the TLB scheme for the Snakes is TLB mapping for four large blocks of memory, each 16Mb in size. From a user's point of view, 64 Mb is far better than the 512Kb mapped by the RS/6000's TLB. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Ron Marquardt E-mail: rmarq@iago.caltech.edu Solid State Device Physics Group California Institute of Technology