Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!uunet!mcsun!ukc!edcastle!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Grandi)
Newsgroups: comp.arch
Subject: Sun bogosities, including MMU thrashing
Message-ID: <PCG.91Jan18142616@teachk.cs.aber.ac.uk>
Date: 18 Jan 91 14:26:16 GMT
References: <1991Jan10.214122.9506@news.arc.nasa.gov> <amos.663857722@shum>
	<5257@auspex.auspex.com> <3956@skye.ed.ac.uk>
Sender: cho@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 277
Nntp-Posting-Host: teachk
In-reply-to: richard@aiai.ed.ac.uk's message of 16 Jan 91 19:28:27 GMT
X-Was-Subject: Re: ~8-job "knee" in response curves on Suns

On 16 Jan 91 19:28:27 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:

richard> I've several times heard about the "knee" in the performance of
richard> Suns as the number of concurrent processes increases, so I wrote
richard> a simple benchmark. [ ... ] Here are some results. [ ... ]

The reason for the appall ingly bad results is simply the appallingly bad
design of virtually *all* the MMUs and schedulers of SunOS since the
very beginning. Note that I said *design*, not implementation -- the Sun
people seem able to do fairly clever hardware, after having come up with
appallingly bad architecture and software (hackers!).

I have been observing catastrophic design mistakes in SUN (and Dec, ...)
hardware or software for a while; since the machines I am most familiar
with are Suns (and 386s) here is a summary I wrote long ago of their
"problems", including the MMU ones.

Briefly, the MMU problem is that for some unfathomable reason (I know
which one, by the way, but I don't agree it is defensible) Sun decided
originally to cache entire page tables in the MMU instead of page tables
*entries*, and then to match an LRU page table replacement policy with a
FIFO scheduler, thus *guaranteeing* thrashing as soon as the number of
active page tables (note: the *total* number, not the *most frequently
active* ones) exceeds the MMU capacity.

Here are my notes, in a crescendo of astonishing bogosity, so that the
MMU notes are near the end.

I also wrote some notes on the wonders of NFS and of using Ethernet as
an IO bus and Ethernet adapters as bus access chips; this will have to
wait a bit for a posting, after the flames for this posting have sizzled
me :-).

========================================================================

I have decided to tell you some funny myths and other tales about Sun
systems. As such, some of them are quite subtle; by no means they are
limited to Sun systems, or to BSD based systems. I could tell you
equally incredible stories about System V (what? it still does expansion
swaps?) or DEC (what? the *pager* has a memory leak?).  Most of this
discussion is historical; the latest SunOS release, 4.1.1, and the
various Sun 4 models may have entirely different bogosities. Somehow I
don't feel inclined to believe that Sun will ever cease to amaze me with
ever new and incredible ones.

Beware: some of the material herein is speculative, but I hope at least
engagingly credible :-).

		BACKGROUND ON FILESYTEMS

The crew that did 4BSD observed that the performance of the V7 Unix
filesystem is poor when demanding applications are run on modern machines
and discs. The problems are poor locality of i-nodes, poor locality of disc
blocks, poor layout of disc blocks, and insufficient clustering of transfers.

A poor way of addressing some of these problems is to first double the
block size from one 512 byte sector to two (like in 2BSD), which is
however not bad (even if criticized by the Unix authors with compelling
arguments), and then raising the block size to 16 sectors, like in
SunOS. Surely this does improve clustering of blocks (you force 16
sectors of a file at a time to be contiguous), it does also improve the
clustering of transfers (when you read the first sector of a block you
also get the next 15).

It is very stupid because it is a poor approximation to the right
answer, which is to do *dynamic* and static clustering of runs of blocks
(actually the BSD guys did implement very complicated forms of static
clustering). It also has some big disadvantages, hinted at by
Thompson&Ritchie in the V7 papers (they advised against doubling the
block size from 1 to 2 sectors with terse and cogent reasoning).

The most obvious of these disadvantages is that Unix files on average
are quite a bit smaller than 8KB, and that having a uniform 8KB block
size would waste in internal fragmentation at least 50-60% of the disc
space on a typical Unix filesystem.

The BSD people therefore decided, complicating matters greatly, and in
the kernel, where things should be as simple as possible, to have
various block sizes, from 1KB to 8KB. Block sizes from 1KB to 7KB were
called "fragments".  A file would be made up of a number (usually zero!)
of 8KB blocks and then a single tail fragment.

This greatly complicates life, especially as the kernel, up to SunOS 3,
has a cache of blocks. Having variable block sizes means having variable
buffer sizes (which complicates life quite a bit, as you must then put
the buffer cache in virtual memory). Each buffer is described by a
buffer header, and now this not only contains its address but also the
block size.

The number of buffer headers and the number of KBytes allocated to the
buffer pool are fixed at boot time. One should have therefore a ratio of
buffer headers to buffer cache size that approximates the average
expected size of a block. Now, you recall from the above discussion that
fragments were introduced precisely because it was expected that a
*large* proportion of files would be smaller than a full block. It can
therefore be expected that a *large* proportion of buffers would contain
an entire file quite smaller than a full block; I would even venture to
suggest, in the absence of hard data, that probably the average buffer
size is around 2KB.

    In particular, directories are frequently accessed, and they had better
    be small. The only frequently accessed files of significant size are
    going to be probably only executables, and BSD executable fetching
    bypasses the buffer cache, thanks to demand paging from the filesystem
    itself.

		THE SunOS 3 BUFFER CACHE	

Well, the default for SunOS 3 is to allocate a buffer header every 8KB
of buffer cache. This means that the assumption is made that all buffers
will contain full blocks, or in other words, that internal fragmentation
should be avoided on discs but not in memory.

In practice, it is my expectation that the typical SunOS 3 buffer cache
will be about 70-80% unused, because the supply of buffer headers will
be exhausted well before that of buffer cache space. Notice that not
only all available statistics seem to point out that most files are well
under 8KB worth, but also a buffer header is under 100 bytes, while a
buffer cache slot is 1KB long; I don't know you, but I would more easily
overallocate and potentially waste 100 bytes resources than 1000 byte
ones...

Notice that the *only* way to override this ridiculous default is to
*patch the kernel binary*, an operation that most think is high wizardry
and would not even dare consider.

    Actually I strongly suspect that even this will not work. The ability to
    have variable length buffers depends on the ability to map small pages;
    since most Suns have a 8KB VM page, probably fragment buffers are not
    supported at all, and therefore raising the number of buffer headers
    will be pretty useless. This of course is not very nice in itself...

Notice also that Unix is often heavvvvily IO bound, and an effective and
large buffer cache is tremendously effective, especially if the users
tend to edit and compile repeatedly a number of small files, as they do
typically do.

Notice also that as another default only 10% of memory is reserved for
the buffer cache, which is, for your typical timesharing or program
development usage pattern, way too small.

It is no wonder that Sun recommends adding more memory to solve
performance problems...

		THE SunOS 4 BUFFERING SCHEME

SunOS 4 no longer uses the buffer cache. All files are accessed via the
virtual memory technology. This should help tremendously in reducing
system overhead, as for most files after the initial 'open()' no system
call is required until the 'close()'. It also means that file pages
compete with process pages for main memory, as there is no longer a
buffer cache, and all of free memory is dedicated to the most
(supposedly) recently used pages, be their process local or memory
mapped files.

    For some reason there is still an 'nbuf' variable in the kernel to set
    the number of buffer headers. It would be interesting to know what role
    have buffer headers in the new architecture. I have this suspicion that
    'nbuf' is just a *limit*.

Unfortunately, the Sun virtual memory technology has an 8KB page size.
Sun justifies this *extravagantly large* page size with the idea that
nowadays memory is cheaper.

This means that under SunOS 4 you effectively no longer have variable
sized buffers against the problem of internal fragmentation.  All of
central memory uses only 8KB pages as the allocation unit. We can thus
contemplate the ridiculous situation that internal fragmentation is a
concern for disc space wastage, but not for memory space.

    Something tells me that as soon Sun looks again at the problem, they
    will remove the code that handles short blocks (fragments) from the
    filesystem as well, because now disc storage is cheap as well :->.

If memory is cheaper, it would be my naive idea to use it to *reduce*
internal fragmentation, by having a small page size; a small page size
implies a larger page table, but memory is supposedly cheap, isn't it?

		THE Sun VIRTUAL MEMORY CACHE

Actually Sun have made another wonderful design decision. Most virtual
memory technologies support multiple page tables, and cache entries
(supposedly the most frequently used ones) from such tables (mostly in
inane ways, some in more intelligent ones). Sun have chosen to cache not
just selected *entries*, but the most frequently used *tables* in their
entirety.  A typical Sun VM cache has 8 slots, and each slot contains
the page table for a process.

When there are more than 8 active processes, a context switch may cause
a page table to be written back to memory, and another to be loaded into
that slot. This costs *a lot*. In practice, this limits severely the
size of a page table, and thus of the address space of a process, and
also puts a lower limit on the size of a page. Sun MMU technology works
well only for for address spaces that are small and densely allocated,
not large and sparse as more modern programming technologies (notably
threading, but also memory mapped files...) would suggest.

The Sun 4 SPARC MMU instead does not cache entire page tables, but
contiguous subsets of these, called 'pmeg's. Each pmeg more or less maps
a region of the address space, such as text, data, bss, stack, or shared
segment. The idea of caching contiguous submaps is not as bad as that of
caching entire maps; in particular it allows the use of smaller pages,
like 4096 bytes, which is still fairly large (32-64 pages per pmeg), but
not too much so. The problem seems to be that SunOS does not share the
pmegs that map shared segments, so that even if a lot of processes are
executing the same executable image or mapping the same shared library
(both extremely likely events) a pmeg for each region will be consumed
by each process.

Unfortunately there are not that many pmeg cache slots around; in a
typical implementation there are about 128 pmeg cache slosts, for a
total of 4096-8192 pages or 16-32 megabytes. This seems enough, until
you realize that each process consumes are least 4 pmegs and potentially
many more, as they are not shared. The total working set of the current
machine load may well be under 16-32 megabytes, but there will often not
be enough pmegs to cover it because because usually a large fraction of
those 16-32 megabyte will be shared, thus requiring multiple pmegs.

Interestingly if SunOS runs out of pmegs it will steal them from an
existing process; it may well happen that a resident process has all its
pmegs stolen. If this happens, its page will be marked unmapped AND
THEREFORE free and swappable, and it will often be swapped even if there
is no memory shortage.

There is another interesting aspect of both to the Sun 3 and and Sun 4
MMU schemes; as far as I remember, the virtual memory cache slots are
managed with some approximation of LRU, i.e.  essentially LIFO, while
the scheduler dispatches processes with prioritized round robin, i.e.
essentially FIFO.  Unfortunately a FIFO access pattern to a LIFO cache
guarantees a cache miss on *every* access if the FIFO is longer than the
LIFO, such as if there are more than 8 active processes. This guarantees
a collapse in throughput. Note that the load average may well be under
8, because the load average counts essentially CPU bound active
processes, while there may well be IO bound active processes.

		THE SunOS 4 OPTIMIZED ACCESS TO FILES

This brings us to another subject. With SunOS 4 files are memory mapped,
i.e. file access is integrated with virtual memory. Virtual memory is
usually managed with some (often poor...)  approximation of LRU, because
virtual memory accesses tend to be clustered in time and space in some
way.

Access to files by contrast is often sequential; in particular,
expecially under the BSD filesystem, sequential access is very much
favoured, so Unix applications tend to use sequential access even when
other file structures and access patterns could be used (copying a file
in its entirety is often preferred to updating it in place).

In particular, the original V7 filesystem did provide read ahead and
write behind, and the large block sizes introduced by BSD essentially
provide more of the same.

Unfortunately a FIFO (sequential file) access pattern tends to go
against the grain of a LIFO (LRU approximating) virtual memory policy.
In particular, when reading a file sequentially, the most recently
accessed block is the one least likely to be used again in the near
future, while the virtual memory subsystem assumes exactly the opposite.

BSD and SunOS, by the way, do provide a system call to advise the paging
subsystem that pages recently referenced will not be reused shortly, but
major applications don't use it at all (e.g.  not 'cp'). For example,
the 'stdio' library could profitably use it on all those files that are
not opened for read/write, and in particular those that are opened for
write or append only, as available statistics show that this happens
fequently and such a file is usually accessed strictly sequentially.

There used to be a system call to suppress not just virtual memory keep
behind, but also to ask for fault-in ahead, i.e. to knowingly circumvent
the 'on-demand' principle of virtual memory management.  It has
apparently disappeared, replaced by ever larger pages, which give some
form of read ahead, but are a big lose for small files and random access
(as Thompson & Ritchie observed long ago).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk