Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!samsung!think!ames!dftsrv!mimsy!chris
From: chris@mimsy.umd.edu (Chris Torek)
Newsgroups: comp.unix.ultrix
Subject: Re: Any disk de-fragmenters out there?
Message-ID: <21063@mimsy.umd.edu>
Date: 2 Dec 89 07:48:52 GMT
References: <2095@compugen.> <7862@bunny.GTE.COM>
Distribution: comp
Organization: U of Maryland, Dept. of Computer Science, Coll. Pk., MD 20742
Lines: 91

In article <7862@bunny.GTE.COM> krs0@GTE.COM (Rod Stephens) writes:
>I was at a DECUS seminar where someone asked how disk fragmentation
>effected performance on Unix systems. The answer was that the file
>system works best if things are NOT contiguous. (This started a series
>of jokes about disk fragmenters ;) Unfortunately that's all I know.
>Does anyone know what this guy meant?

He was probably talking about matching rotational and processing
latency.

The average Unix program reads files using stdio.  Stdio reads files
one block (underlying file system block size, typically 4K, 8K, 16K,
32K, or 64K, although sizes greater than 8K do not occur on current VAX
Unix systems) at a time.  At the same time as the current block is
read, the next block (of up to 12 total, since only direct block
entries participate in this) is brought in to the buffer cache via
breada().  The two read requests are passed to the disk device driver
as two separate requests, and most disk devices can handle only one
request at a time [assumption #1].

Thus, the first request (for the desired block) is sent to the disk,
and the second is placed on a queue.  When the first one finishes
transferring into main memory, the device interrupts the CPU, which has
to figure out what has happened, notice the next read request, and pass
that on to the device.  During that time, the disk may have passed the
point where it can read the next sector immediately [assumption #2].
If the second block of the file is contiguous with the first, the disk
head will be over the middle of that block, and the CPU will have to
wait for one complete revolution of the disk (typically 1/60th of a
second) for the second block [assumption (hidden this time) #3], by
which time the application reading the file is likely to already have
requested the second block.  In other words, the application will have
to wait.  If the application has to wait, the next read request will
proceed much as the previous one, and the application will have to wait
for each disk block, despite the read-ahead.

If the blocks are separated by a gap, however, the analysis changes as
follows.  The first block works more or less as before, but this time
the second block is not already under the disk head, so it comes in as
soon as the head reaches that block.  The CPU has only to wait for the
current block to pass under the head, and for the next block to
transfer into memory.  On a `typical' Fujitsu Eagle (48 sectors/track)
this is essentially 2/3 (32/48) the time for a full rotation (for 8K
byte blocks).  On a larger disk (with more sectors per track) the ratio
is better.  Now the application need not wait as long, or---if it is
sufficiently slow---perhaps not even at all.  If it does not wait at
all, the next read request (issued at the same time the read-ahead
block is returned to the application) might have to wait for nearly a
full rotation again (if we have bad luck and the application uses the
first read-ahead block just as the disk head is over the third file
block) or might not.

(If the application is REALLY slow, random block placement works just
as well as anything else, since the read-ahead always has plenty of
time to move the disk head to the next block, so we can ignore this
case.)

Now, the disk sometimes also has to switch heads and/or cylinders.
This can take time.  Typical head switch delays are 0 (disks with servo
tracks) or ~.2ms (embedded sync) or ridiculous (some DEC RA series,
embedded sync plus stupidity); typical track-to-track seek times are
~1ms.  On head or cylinder switch, one also wants the next block not to
be contiguous, no matter what the application's speed may be, since the
system can try to match the rotational delay with the switch delay.
The BSD FFS does not do a good job of this, so I am not going to say
much more here.

Here are the assumptions marked above:

	#1: disk does only one thing at a time.

True for VAX except DEC RA disks on DEC MSCP controllers.  Many of
these controllers are so horribly slow (UDA50 => 8085 w/ 16 or 32 K of
cache) that this does not matter anyway.  Others (HSC => LSI-11,
presumably with decent amounts of RAM) can do a better job (which is
not to say that they necessarily do so).

	#2: CPU+device is slow enough to let sector headers pass
	    by during interrupt handling.

True on many machines, becomes false as the machines get faster or
the drivers get better.  Once the machine is fast enough, contiguous
becomes good.  (VAXen are not getting faster very fast.)

	#3: disk+controller does not do read-ahead.

True on most (all?) VAX systems, false on modern SCSI systems.  A good
read-ahead cache on the disk changes everything.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris