Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ames.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!bellcore!decvax!hplabs!ames!eugene
From: eugene@ames.UUCP (Eugene Miya)
Newsgroups: net.arch
Subject: Re: CRAY Question (also MIP/Mhz)
Message-ID: <1496@ames.UUCP>
Date: Thu, 1-May-86 12:08:53 EDT
Article-I.D.: ames.1496
Posted: Thu May  1 12:08:53 1986
Date-Received: Sat, 3-May-86 19:50:03 EDT
References: <905@harvard.UUCP>
Distribution: net
Organization: NASA-Ames Research Center, Mtn. View, CA
Lines: 47

> In one of Jack Dongarra's articles on LINPACK performance
> (Computer Archicture News, vol 11, no 5 (Dec 83)), he says that a
> CRAY 1-M executes the benchmark faster than a CRAY 1-S because the
> 1-M has slower memory.  I fail to see how it
> is even theoretically possible for slower memory to mean higher performance,
> and would appreciate someone who knows about CRAY's explaining this to me
> (Dongarra talks about a "missed chain-slot").
> 						Ehud Reiter

George Spix from Cray Research should be able to this (gas@lanl), but
I've not seen him lately, so I'm give it a shot.

First point of clarification.  Technically, there are no more Cray-1Ms anymore.
We had one here, and it was redesignated a Cray X-MP/1.  This is a
machine which is moving to UC Berkeley Next month.  Second, you should
realize that X-MPs represent cleaned up Cray-1's (not 1S).  They have
a faster cycle time: 9.5 ns versus 12.5 ns, they have vector chaining,
(I assume you know what chaining is, otherwise check an architecture
book, you message did not sound like a specific request for chaining
description), they have three paths between memory and CPU rather than one.
The X-MP/1 has a slower MOS rather than bipolar memory which comes with
'top of the line' (read: current fastest model Xs, the 2 is MOS, and it
also only has one data path to a given quadrant of memory).
Lastly, machines like these are not like micros and minis in that
you really tune them for the slowness of memory (any memory): delay
loops are unacceptable.  You count clock periods and make architectural
features to compensate for them (i.e. chaining).  It takes four clocks
to get a word of memory into a CPU (assuming no bank contention).
I have also been told by a Cray site engineer here that the newer
MOS memories also have a slightly different internal organization.

This is all why I pointed out to the fellow at NC that the MIPs/Mhz
thing has a von Neumann bottleneck problem (via mail).  Lastly,  I
have measured the effects of why this has happened, and I posted this
to the Net over a year ago, but in cleaning my old author_copy file,
I decided to remove it (I included a graph in that posting).
Aside the X-MP, also has a nice hardware box know as the Hardware
Performance Monitor which does instruction counts non-obtusively
(another reason why a VAX is a poor machine to do performance
research on).

From the Rock of Ages Home for Retired Hackers:
--eugene miya
  NASA Ames Research Center
  com'on do you trust Reply commands with all these different mailers?
  {hplabs,ihnp4,dual,hao,decwrl,tektronix,allegra}!ames!aurora!eugene
  eugene@ames-aurora.ARPA