Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: Sun bogosities, including MMU thrashing Message-ID: Date: 28 Jan 91 11:17:18 GMT References: <1991Jan10.214122.9506@news.arc.nasa.gov> <1991Jan25.185333.607@quick.com> Sender: aro@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 98 Nntp-Posting-Host: odin In-reply-to: srg@quick.com's message of 25 Jan 91 18:53:33 GMT On 25 Jan 91 18:53:33 GMT, srg@quick.com (Spencer Garrett) said: srg> In article , srg> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: [ ... the BSD FFS has block and fragment sizes from 1KB to 8KB in 1KB increments, but pages are fixed size to 8KB, and the buffer cache is VM mapped in 4BSD/SunOS ... ] pcg> This greatly complicates life, especially as the kernel, up to SunOS 3, pcg> has a cache of blocks. Having variable block sizes means having variable pcg> buffer sizes (which complicates life quite a bit, as you must then put pcg> the buffer cache in virtual memory). Each buffer is described by a pcg> buffer header, and now this not only contains its address but also the pcg> block size. srg> You've made an incorrect assumption here, and the rest of your story srg> about the buffer cache bogosities therefore falls wide of the mark I cannot see where I made an incorrect assumption. Maybe I have not been clear. Maybe you just have not understood what I wrote. Others did, though. srg> The buffer cache is a *disk* cache, not a *file* cache. The well known fact that the UNIX cache is a disk cache has no effect whatsoever on my argument. Incidentally, as far as I know no USG/BSD derived Unix kernel uses a file cache, in the sense in which OS/360 and others use to. Amoeba has some kind of a file cache, but its FS designed is very unconventional, taking to an extreme the idea of having the first block of a file stored in the inode (it stores the *entire* file in the inode, in a sense, as all files are made up of a single block, and blocks are variable sized, and block sizes coincide with file sizes). But Amoeba is not Unix, even if it can emulate its API quite closely. Another system that does file caching I know is CMU's Virtue, but again in a quite different spirit and setting from Unix. srg> When the filesystem code gets a transfer request (read or write) it srg> translates the file offset into an absolute sector number and srg> *then* goes to the disk cache to find that sector (for efficiency's srg> sake it actually does ranges of sectors at a time, but the idea is srg> the same). No, it does not. All *filesystem* IO is done in block (System V) or fragment (BSD) units; only the driver and the cache manager actually see sector numbers. The filesystem code is carefully written using macros that abstract away from the physical sector number. I may well have been hallucinating for the past dozen years, but reading the kernel code, the Unix IO system paper, the BSD FFS paper, and the 4.3 book, gives me a completely different story from yours: srg> Lots of files are smaller than 8k, and one disk buffer may well srg> hold several of them. I am quite sure this does not happen under 4BSD (or under *any* other UNIX implementation derived from USG or BSD sources) and under SunOS in particular, which on most machines simply does not have buffers smaller than 8KB because they are VM mapped, VM pages are 8KB long, and buffers are made up of *multiple* pages. Please reread more carefully the above mentioned papers and books. I will do the same, just in case. You may be confused in that it does happen that the same 8KB disk block may indeed become split into 8 1KB fragments *on the disk*, each holding a different file. But in the buffer cache each of the fragments will be allocated a different buffer slot, which has a minimum size of one page, which is 8KB on many Suns (or 4KB on many others). Since as you seem to correctly remember *most* files are under 8KB, this means that, as I have remarked, internal fragmentation is countered by fragments on disk, but not in the buffer cache. This is particularly bad as the percentage of files that are wel under 8KB is greater for the dynamic case than for the static case. It may be thought that a disk block of 8KB containing 8 1KB fragments could be mapped eight times as 8 different buffer slots; in practice this cannot happen because: 1) each fragment would be at a different offset in the buffer slot, complicating matters even more; 2) it would be hard to make the file grow, as this would require taking care to unmap the old block and remap the new one, and copying, and so on. The BSD FFS/buffer cache design was meant for architectures where the VM page size was about the same size as a fragment, not as a block, so growing the last (direct?) block of a file from 1KB to 8KB could be done just by mapping new pages in the same buffer slot, without any copying at all. The BSD assumption was not bad -- the smaller the page size the better, with dynamic clustering for spreading IO latencies and overheads. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk