Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Re: Sun bogosities, including MMU thrashing
Message-ID: <PCG.91Jan28111718@odin.cs.aber.ac.uk>
Date: 28 Jan 91 11:17:18 GMT
References: <1991Jan10.214122.9506@news.arc.nasa.gov> <amos.663857722@shum>
	<PCG.91Jan18142616@teachk.cs.aber.ac.uk>
	<1991Jan25.185333.607@quick.com>
Sender: aro@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 98
Nntp-Posting-Host: odin
In-reply-to: srg@quick.com's message of 25 Jan 91 18:53:33 GMT

On 25 Jan 91 18:53:33 GMT, srg@quick.com (Spencer Garrett) said:

srg> In article <PCG.91Jan18142616@teachk.cs.aber.ac.uk>,
srg> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

	[ ... the BSD FFS has block and fragment sizes from
	1KB to 8KB in 1KB increments, but pages are fixed size
	to 8KB, and the buffer cache is VM mapped in 4BSD/SunOS ... ]

pcg> This greatly complicates life, especially as the kernel, up to SunOS 3,
pcg> has a cache of blocks. Having variable block sizes means having variable
pcg> buffer sizes (which complicates life quite a bit, as you must then put
pcg> the buffer cache in virtual memory). Each buffer is described by a
pcg> buffer header, and now this not only contains its address but also the
pcg> block size.

srg> You've made an incorrect assumption here, and the rest of your story
srg> about the buffer cache bogosities therefore falls wide of the mark

I cannot see where I made an incorrect assumption. Maybe I have not been
clear.  Maybe you just have not understood what I wrote. Others did,
though.

srg> The buffer cache is a *disk* cache, not a *file* cache.

The well known fact that the UNIX cache is a disk cache has no effect
whatsoever on my argument. Incidentally, as far as I know no USG/BSD
derived Unix kernel uses a file cache, in the sense in which OS/360 and
others use to. Amoeba has some kind of a file cache, but its FS designed
is very unconventional, taking to an extreme the idea of having the
first block of a file stored in the inode (it stores the *entire* file
in the inode, in a sense, as all files are made up of a single block,
and blocks are variable sized, and block sizes coincide with file
sizes). But Amoeba is not Unix, even if it can emulate its API quite
closely. Another system that does file caching I know is CMU's Virtue,
but again in a quite different spirit and setting from Unix.

srg> When the filesystem code gets a transfer request (read or write) it
srg> translates the file offset into an absolute sector number and
srg> *then* goes to the disk cache to find that sector (for efficiency's
srg> sake it actually does ranges of sectors at a time, but the idea is
srg> the same).

No, it does not. All *filesystem* IO is done in block (System V) or
fragment (BSD) units; only the driver and the cache manager actually
see sector numbers. The filesystem code is carefully written using
macros that abstract away from the physical sector number.


I may well have been hallucinating for the past dozen years, but reading
the kernel code, the Unix IO system paper, the BSD FFS paper, and the
4.3 book, gives me a completely different story from yours:

srg> Lots of files are smaller than 8k, and one disk buffer may well
srg> hold several of them.

I am quite sure this does not happen under 4BSD (or under *any* other
UNIX implementation derived from USG or BSD sources) and under SunOS in
particular, which on most machines simply does not have buffers smaller
than 8KB because they are VM mapped, VM pages are 8KB long, and buffers
are made up of *multiple* pages. Please reread more carefully the above
mentioned papers and books. I will do the same, just in case.

You may be confused in that it does happen that the same 8KB disk block
may indeed become split into 8 1KB fragments *on the disk*, each holding
a different file.

But in the buffer cache each of the fragments will be allocated a
different buffer slot, which has a minimum size of one page, which is
8KB on many Suns (or 4KB on many others). Since as you seem to correctly
remember *most* files are under 8KB, this means that, as I have
remarked, internal fragmentation is countered by fragments on disk, but
not in the buffer cache. This is particularly bad as the percentage of
files that are wel under 8KB is greater for the dynamic case than for
the static case.

It may be thought that a disk block of 8KB containing 8 1KB fragments
could be mapped eight times as 8 different buffer slots; in practice
this cannot happen because:

1) each fragment would be at a different offset in the buffer slot,
complicating matters even more;

2) it would be hard to make the file grow, as this would require
taking care to unmap the old block and remap the new one, and copying,
and so on.

The BSD FFS/buffer cache design was meant for architectures where the VM
page size was about the same size as a fragment, not as a block, so
growing the last (direct?) block of a file from 1KB to 8KB could be done
just by mapping new pages in the same buffer slot, without any copying
at all. The BSD assumption was not bad -- the smaller the page size the
better, with dynamic clustering for spreading IO latencies and
overheads.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk