Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!ptimtc!nntp-server.caltech.edu!iago.caltech.edu!rmarq
From: rmarq@iago.caltech.edu (Marquardt, Ron R.)
Newsgroups: comp.arch
Subject: Re: Vector vs Cache/Superscalar
Message-ID: <1991May4.182729.20523@nntp-server.caltech.edu>
Date: 4 May 91 17:27:45 GMT
References: <1991May4.031835.7979@midway.uchicago.edu <11875@mentor.cc.purdue.edu>
Sender: news@nntp-server.caltech.edu
Reply-To: rmarq@iago.caltech.edu
Organization: California Institute of Technology
Lines: 53
News-Software: VAX/VMS VNEWS 1.3-4x

In article <11875@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu
(Herman Rubin) writes...

[regarding previous comments contrasting the RS/6000 superscalar
 architecture with the Stardent vector architecture]

>> (b) Tridiagonal solving.  Comes up in lots of codes, and it is
>> a real vector-breaker.  In fact, vector machines choke on all
>> sorts of recursion, whereas the superscalars just love them.
>> On the RS/6000, the tridiag code basically vanished, whereas on
>> the vector Stardent, it was a bottleneck.
> 
>There are tricky ways of doing this efficiently on vector machines,
>especially flexible ones.  This uses partitioning.
> 

Our experience with these machines (we have two IBM RS/6000's, a
530 and a 540, and three of what are now Stardent 3000's) has 
shown that for matrix problems exceeding 200x200, tridiagonal
or otherwise, the Stardents can be considerably faster than
the RS/6000's.  The relative performance difference increases
with the size of the problem and can reach as great as a factor
of 10 (for naively written code).  This problem can be traced to
the small TLB (mapping 512K) on the IBM machines.  Back when Stardent
was Ardent, they had similar problems, which were more easily
fixed since their TLB is external.  The Stardent ETLBs now can
map >128 Mb (and I believe it goes as high as the maximum allowed
memory in the system, which escapes me).

For an admittingly simple benchmark, a traditionally coded matrix
multiply, the 530 is actually SLOWER than a Sparc 1, and the 540
only marginally faster, for 701x701 [530: 568 sec., Sparc 1: 525 sec.,
and 540: 470 sec.].  Before people complain too loudly, it is very
easy to recode a matrix multiply for better efficiency and get the 540
time down to 49 sec., BUT it is not necessarily true that for all
matrix routines such a recoding is simple, or even possible.  Better
compiler technology would certainly help, and IBM's latest beta version
for the RS/6000's (exact version ID escapes me) promises a great
deal of improvement.  For arbitrarily complex and general problems,
however, a larger TLB would certainly benefit the RS/6000 family.

By the way, although they are not superscalar, the latest HP 
workstations seem to be avoiding the TLB problems of the IBMs.
Included in the TLB scheme for the Snakes is TLB mapping for
four large blocks of memory, each 16Mb in size.  From a user's point
of view, 64 Mb is far better than the 512Kb mapped by the RS/6000's
TLB.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Ron Marquardt				E-mail: rmarq@iago.caltech.edu
Solid State Device Physics Group
California Institute of Technology