Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!rutgers!apple!bloom-beacon!usc!elroy!ames!ames.arc.nasa.gov!lamaster
From: lamaster@ames.arc.nasa.gov (Hugh LaMaster)
Newsgroups: comp.arch
Subject: Re: DMA on RISC-based systems
Message-ID: <26855@ames.arc.nasa.gov>
Date: 12 Jun 89 14:18:24 GMT
References: <26636@ames.arc.nasa.gov> <8327@killer.DALLAS.TX.US>
Sender: usenet@ames.arc.nasa.gov
Organization: NASA - Ames Research Center
Lines: 110

In article <8327@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:

>in article <26636@ames.arc.nasa.gov>, lamaster@ames.arc.nasa.gov (Hugh LaMaster) says:

>> mainframes, I have seen single applications which *averaged* 3 MB/sec on
>> 4.5 MB/sec channels on 8 simultaneous data streams.

>Which particular mainframes? Sounds like something a Cray could do...

This exact performance figure is from a Cyber 205, but I have seen similar
performance on Crays (not quite as good *then*, but should be better now
because of faster disks - newer disks run at ~100 Mbits/sec transfer rate
as opposed to the older 36 Mbits/sec disks)

Also, I expect large IBM mainframes to do almost as well.  Although the disk
transfer rate is not as high, the disk controller to channel connection runs
at 4.5 MBytes/sec on some models.

(*Aside*)

These I/O rates are not particularly high by mainframe standards, just by
Mini/Micro standards.  There used to be a rule of thumb that for balance
a system should have a constant ratio of 1 MIPS/1 MByte/1 Mbyte/sec I/O.
The latter was slightly nebulous, but usually interpreted as channels
capable of it and disks capable of reading at that rate sustained.  
It was also considered a "good idea" if disk and channel utilization was 
less than 5% of raw aggregate capacity in order to guarantee that the
disk subsystem was not the bottleneck.  I actually did a study once and found
that the ratio on one heavily used (i.e. many users) system here actually used
15KB/sec/MIP *average*.   This (mainframe) system was capable of at least 
.5 MB/sec/MIP I/O.  This 3% utilization helped make the CPU the bottleneck.

Disk I/O is the usual bottleneck on mini/micro systems.  This is not 
necessarily a "problem", it is just a system design and configuration tradeoff.  
(*end Aside*)

On a Cray, if you have an SSD, your I/O rate can run a *lot* faster than
the above disk rates.


>very little overhead there at all (don't have to cope with memory
>protection, can DMA straight into the user's data space without

Yes, this is part of the reason such rates can be sustained.  These rates
were always with data copied directly into user memory.  I note that there
is a way to do this some Unix systems:  a facility to map virtual
memory to files.  Then "paging" can potentially move the data directly into
memory without copying.  This is the case where virtual memory actually helps.
Most of the time it doesn't matter one way or the other for this problem.

>worrying about how "real" memory maps into the user's "virtual"
>memory, etc.).

Anyway, the Cyber 205 is a virtual machine.  VM has nothing to do with it
specifically.  The cost of copying large blocks of data is much less on a
Cray or Cyber 205/ETA machine because block data copies are done at vector
rate, and there is enough memory bandwidth available to sustain such rates.

Crays have memory protection, and the Operating System still has to figure
out what real memory addresses user memory buffers are in.  It takes a few
microseconds to do this either way, virtual or not.  These operations were
actually faster on the Cyber 205 than on the Cray X-MP/48, for various reasons.
The cost of an I/O operation has generally been in
figuring out where the data is in disk and initiating and sustaining the
transfer.  The Cyber 205 did this quickly because the hardware had *very*
capable controllers which did all the cylinder/track/sector mapping, and
presented a simple blockserver interface to the operating system.
(The 205 did not have the complicated "channel program" problem that IBM's
have because this overhead was all done in the controllers.)

>Sounds to me like another speed reason for Crays to not have virtual
>memory :-) (for the old veterans of past comp.arch discussions). Have

It sounds like a reason for systems to support fast I/O to me :-)

1) parallel I/O paths to memory (aka "channels")
2) fast disks
3) low overhead to do a raw disk operation
4) lots of memory bandwidth

5) operating systems which support multiple asynchronous I/O requests

6) operating systems which support transfer of data directly into user
	memory without being buffered elsewhere

**********************************************************************

I have an actual number to present here:

I have seen a significant number of applications which can only do about
20 floating point operations per word of I/O, unless the entire problem
can be memory contained.  The memory required for the entire problem
is in the range of 1 Million Words for every 1 to 10 MFLOPS.  So, a single
job running at ~100 MFLOPS may need about 800 MBytes, *or* the ability
to do I/O at a rate of 40 MBytes/sec.  

The single job referred to earlier was running at about 200 MFLOPS on a 
Cyber 205 and needed about 50 Mbytes/sec of I/O (it didn't get it - it
only got ~24 MBytes/sec)  I do not remember exactly how much memory was needed,
but it was significantly more than 32 MW (256 MBytes).

You have to look at the requirements of the entire problem before you
can say what your system requirements are.

**********************************************************************

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117