Path: utzoo!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!bloom-beacon!oberon!cit-vax!mangler
From: mangler@cit-vax.Caltech.Edu (Don Speck)
Newsgroups: comp.unix.wizards
Subject: Re: Vax 11/780 performance vs Sun 4/280 performance
Keywords: readahead, striping, file mapping
Message-ID: <6963@cit-vax.Caltech.Edu>
Date: 16 Jun 88 06:32:08 GMT
References: <22957@bu-cs.BU.EDU> <14968@brl-adm.ARPA> <601@modular.UUCP> <23288@bu-cs.BU.EDU> <7980@alice.UUCP> <23326@bu-cs.BU.EDU>
Organization: California Institute of Technology
Lines: 71

In article <23326@bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes:
> I think the proper question is sort/merging a disk farm and doing 1000
> transactions/sec or more while keeping 8 or 12 tapes turning at or
> near their rated 200 ips, not pushing bits thru a single channel

The hard part of this is getting enough disk throughput to feed even
one of those 200-ips tape drives.  The rest is replication.

Channels sound like essentially moving the disk driver into an I/O
processor, with lists of channel control blocks being analogous to
lists of struct buf's.	This makes it feasible to do more optimizations,
even real-time stuff like scatter-gather, chaining, and rotational
scheduling.

Barry mentions the UDA-50 as being similar.  But its processor is an
8085, and DMA speed is only 0.8 MB/s, making it much slower than a dumb
controller.  And the driver ends up spending as much time constructing
the channel control blocks as it would spend tending a dumb controller
like the Emulex SC7003.  The Xylogics 450, Xylogics 472, and DEC TS11
are like this too.  I find them all disappointingly slow.

I suspect the real reason for channel processors is to reduce interrupts,
which are so costly on big CPU's.  It makes sense for terminals; people
have made I/O processors that talk to Unix in clists (KMC-11's, etc)
which cuts the total interrupt rate by a large fraction.  But I don't
think it's necessary, or necessarily desirable, to inflict this on disks
& tapes, and certainly not unless the channel processor can talk in
struct buf's.

For all the optimizations that these I/O processors are supposed to do,
Unix rarely gives them the chance.  Unless there's more than two requests
outstanding at once, once they finish one, there's only one request to
choose from.  Unix has minimal readahead, so that's as many requests as
a single process can generate.	Raw I/O is even worse.

Asynchronous reads would be the obvious way to get enough requests in
the queue to optimize, but that seems unlikely to happen.  Rather,
explicit read commands are giving way to memory-mapped files (in Mach
and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
remains to be seen whether much attention is put into this.

Barry credits the asynchronous nature of I/O on mainframe OS's to the
access methods, like RMS on VMS.  People avoid those when they want
speed (imagine using dbm to do sequential reads).  For instance, the
VMS "copy" command bypasses RMS when copying disk-to-disk, with the
curious result that it's faster to copy to a disk than to the null
device, because the null device is record-oriented, requiring RMS.

As DMR demonstrates, parallel-transfer disks are great for big files.
They're horrendously expensive though, and it's hard enough to find
controllers that keep up with even 3 MB/s, much less 10 MB/s.  But
they can be simulated with ordinary disks by striping across multiple
controllers, *if* the disks rotate as one.  Does anyone know of a cost-
effective disk that can phase-lock its spindle motor to that of a second
disk, or perhaps with the AC line?  With direct-drive electronically-
controlled motors becoming common, this should be possible.  The Eagle
has such a motor, but no provision for external sync.  I recall stories
of Cray's using phase-locked disks to advantage.

Of course, to get the most from high transfer rates, you need large
blocksizes; DMR's example looked like about one revolution.  Hence
the extent-based file allocation of mainframe OS's, etc.  Perhaps
it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?
It would use 0.3% of memory for additional kernel page tables on a
VAX, but proportionately less on machines with larger page sizes.
8192 is practically the *minimum* blocksize on Suns, these days.

The one point that nobody mentioned is that you don't want the CPU
copying the data around between kernel and user address spaces when
there's a lot!	(Maybe it was just too obvious).

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck