Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!samsung!uunet!mcsun!ukc!edcastle!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Sun bogosities, including MMU thrashing Message-ID: Date: 18 Jan 91 14:26:16 GMT References: <1991Jan10.214122.9506@news.arc.nasa.gov> <5257@auspex.auspex.com> <3956@skye.ed.ac.uk> Sender: cho@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 277 Nntp-Posting-Host: teachk In-reply-to: richard@aiai.ed.ac.uk's message of 16 Jan 91 19:28:27 GMT X-Was-Subject: Re: ~8-job "knee" in response curves on Suns On 16 Jan 91 19:28:27 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said: richard> I've several times heard about the "knee" in the performance of richard> Suns as the number of concurrent processes increases, so I wrote richard> a simple benchmark. [ ... ] Here are some results. [ ... ] The reason for the appall ingly bad results is simply the appallingly bad design of virtually *all* the MMUs and schedulers of SunOS since the very beginning. Note that I said *design*, not implementation -- the Sun people seem able to do fairly clever hardware, after having come up with appallingly bad architecture and software (hackers!). I have been observing catastrophic design mistakes in SUN (and Dec, ...) hardware or software for a while; since the machines I am most familiar with are Suns (and 386s) here is a summary I wrote long ago of their "problems", including the MMU ones. Briefly, the MMU problem is that for some unfathomable reason (I know which one, by the way, but I don't agree it is defensible) Sun decided originally to cache entire page tables in the MMU instead of page tables *entries*, and then to match an LRU page table replacement policy with a FIFO scheduler, thus *guaranteeing* thrashing as soon as the number of active page tables (note: the *total* number, not the *most frequently active* ones) exceeds the MMU capacity. Here are my notes, in a crescendo of astonishing bogosity, so that the MMU notes are near the end. I also wrote some notes on the wonders of NFS and of using Ethernet as an IO bus and Ethernet adapters as bus access chips; this will have to wait a bit for a posting, after the flames for this posting have sizzled me :-). ======================================================================== I have decided to tell you some funny myths and other tales about Sun systems. As such, some of them are quite subtle; by no means they are limited to Sun systems, or to BSD based systems. I could tell you equally incredible stories about System V (what? it still does expansion swaps?) or DEC (what? the *pager* has a memory leak?). Most of this discussion is historical; the latest SunOS release, 4.1.1, and the various Sun 4 models may have entirely different bogosities. Somehow I don't feel inclined to believe that Sun will ever cease to amaze me with ever new and incredible ones. Beware: some of the material herein is speculative, but I hope at least engagingly credible :-). BACKGROUND ON FILESYTEMS The crew that did 4BSD observed that the performance of the V7 Unix filesystem is poor when demanding applications are run on modern machines and discs. The problems are poor locality of i-nodes, poor locality of disc blocks, poor layout of disc blocks, and insufficient clustering of transfers. A poor way of addressing some of these problems is to first double the block size from one 512 byte sector to two (like in 2BSD), which is however not bad (even if criticized by the Unix authors with compelling arguments), and then raising the block size to 16 sectors, like in SunOS. Surely this does improve clustering of blocks (you force 16 sectors of a file at a time to be contiguous), it does also improve the clustering of transfers (when you read the first sector of a block you also get the next 15). It is very stupid because it is a poor approximation to the right answer, which is to do *dynamic* and static clustering of runs of blocks (actually the BSD guys did implement very complicated forms of static clustering). It also has some big disadvantages, hinted at by Thompson&Ritchie in the V7 papers (they advised against doubling the block size from 1 to 2 sectors with terse and cogent reasoning). The most obvious of these disadvantages is that Unix files on average are quite a bit smaller than 8KB, and that having a uniform 8KB block size would waste in internal fragmentation at least 50-60% of the disc space on a typical Unix filesystem. The BSD people therefore decided, complicating matters greatly, and in the kernel, where things should be as simple as possible, to have various block sizes, from 1KB to 8KB. Block sizes from 1KB to 7KB were called "fragments". A file would be made up of a number (usually zero!) of 8KB blocks and then a single tail fragment. This greatly complicates life, especially as the kernel, up to SunOS 3, has a cache of blocks. Having variable block sizes means having variable buffer sizes (which complicates life quite a bit, as you must then put the buffer cache in virtual memory). Each buffer is described by a buffer header, and now this not only contains its address but also the block size. The number of buffer headers and the number of KBytes allocated to the buffer pool are fixed at boot time. One should have therefore a ratio of buffer headers to buffer cache size that approximates the average expected size of a block. Now, you recall from the above discussion that fragments were introduced precisely because it was expected that a *large* proportion of files would be smaller than a full block. It can therefore be expected that a *large* proportion of buffers would contain an entire file quite smaller than a full block; I would even venture to suggest, in the absence of hard data, that probably the average buffer size is around 2KB. In particular, directories are frequently accessed, and they had better be small. The only frequently accessed files of significant size are going to be probably only executables, and BSD executable fetching bypasses the buffer cache, thanks to demand paging from the filesystem itself. THE SunOS 3 BUFFER CACHE Well, the default for SunOS 3 is to allocate a buffer header every 8KB of buffer cache. This means that the assumption is made that all buffers will contain full blocks, or in other words, that internal fragmentation should be avoided on discs but not in memory. In practice, it is my expectation that the typical SunOS 3 buffer cache will be about 70-80% unused, because the supply of buffer headers will be exhausted well before that of buffer cache space. Notice that not only all available statistics seem to point out that most files are well under 8KB worth, but also a buffer header is under 100 bytes, while a buffer cache slot is 1KB long; I don't know you, but I would more easily overallocate and potentially waste 100 bytes resources than 1000 byte ones... Notice that the *only* way to override this ridiculous default is to *patch the kernel binary*, an operation that most think is high wizardry and would not even dare consider. Actually I strongly suspect that even this will not work. The ability to have variable length buffers depends on the ability to map small pages; since most Suns have a 8KB VM page, probably fragment buffers are not supported at all, and therefore raising the number of buffer headers will be pretty useless. This of course is not very nice in itself... Notice also that Unix is often heavvvvily IO bound, and an effective and large buffer cache is tremendously effective, especially if the users tend to edit and compile repeatedly a number of small files, as they do typically do. Notice also that as another default only 10% of memory is reserved for the buffer cache, which is, for your typical timesharing or program development usage pattern, way too small. It is no wonder that Sun recommends adding more memory to solve performance problems... THE SunOS 4 BUFFERING SCHEME SunOS 4 no longer uses the buffer cache. All files are accessed via the virtual memory technology. This should help tremendously in reducing system overhead, as for most files after the initial 'open()' no system call is required until the 'close()'. It also means that file pages compete with process pages for main memory, as there is no longer a buffer cache, and all of free memory is dedicated to the most (supposedly) recently used pages, be their process local or memory mapped files. For some reason there is still an 'nbuf' variable in the kernel to set the number of buffer headers. It would be interesting to know what role have buffer headers in the new architecture. I have this suspicion that 'nbuf' is just a *limit*. Unfortunately, the Sun virtual memory technology has an 8KB page size. Sun justifies this *extravagantly large* page size with the idea that nowadays memory is cheaper. This means that under SunOS 4 you effectively no longer have variable sized buffers against the problem of internal fragmentation. All of central memory uses only 8KB pages as the allocation unit. We can thus contemplate the ridiculous situation that internal fragmentation is a concern for disc space wastage, but not for memory space. Something tells me that as soon Sun looks again at the problem, they will remove the code that handles short blocks (fragments) from the filesystem as well, because now disc storage is cheap as well :->. If memory is cheaper, it would be my naive idea to use it to *reduce* internal fragmentation, by having a small page size; a small page size implies a larger page table, but memory is supposedly cheap, isn't it? THE Sun VIRTUAL MEMORY CACHE Actually Sun have made another wonderful design decision. Most virtual memory technologies support multiple page tables, and cache entries (supposedly the most frequently used ones) from such tables (mostly in inane ways, some in more intelligent ones). Sun have chosen to cache not just selected *entries*, but the most frequently used *tables* in their entirety. A typical Sun VM cache has 8 slots, and each slot contains the page table for a process. When there are more than 8 active processes, a context switch may cause a page table to be written back to memory, and another to be loaded into that slot. This costs *a lot*. In practice, this limits severely the size of a page table, and thus of the address space of a process, and also puts a lower limit on the size of a page. Sun MMU technology works well only for for address spaces that are small and densely allocated, not large and sparse as more modern programming technologies (notably threading, but also memory mapped files...) would suggest. The Sun 4 SPARC MMU instead does not cache entire page tables, but contiguous subsets of these, called 'pmeg's. Each pmeg more or less maps a region of the address space, such as text, data, bss, stack, or shared segment. The idea of caching contiguous submaps is not as bad as that of caching entire maps; in particular it allows the use of smaller pages, like 4096 bytes, which is still fairly large (32-64 pages per pmeg), but not too much so. The problem seems to be that SunOS does not share the pmegs that map shared segments, so that even if a lot of processes are executing the same executable image or mapping the same shared library (both extremely likely events) a pmeg for each region will be consumed by each process. Unfortunately there are not that many pmeg cache slots around; in a typical implementation there are about 128 pmeg cache slosts, for a total of 4096-8192 pages or 16-32 megabytes. This seems enough, until you realize that each process consumes are least 4 pmegs and potentially many more, as they are not shared. The total working set of the current machine load may well be under 16-32 megabytes, but there will often not be enough pmegs to cover it because because usually a large fraction of those 16-32 megabyte will be shared, thus requiring multiple pmegs. Interestingly if SunOS runs out of pmegs it will steal them from an existing process; it may well happen that a resident process has all its pmegs stolen. If this happens, its page will be marked unmapped AND THEREFORE free and swappable, and it will often be swapped even if there is no memory shortage. There is another interesting aspect of both to the Sun 3 and and Sun 4 MMU schemes; as far as I remember, the virtual memory cache slots are managed with some approximation of LRU, i.e. essentially LIFO, while the scheduler dispatches processes with prioritized round robin, i.e. essentially FIFO. Unfortunately a FIFO access pattern to a LIFO cache guarantees a cache miss on *every* access if the FIFO is longer than the LIFO, such as if there are more than 8 active processes. This guarantees a collapse in throughput. Note that the load average may well be under 8, because the load average counts essentially CPU bound active processes, while there may well be IO bound active processes. THE SunOS 4 OPTIMIZED ACCESS TO FILES This brings us to another subject. With SunOS 4 files are memory mapped, i.e. file access is integrated with virtual memory. Virtual memory is usually managed with some (often poor...) approximation of LRU, because virtual memory accesses tend to be clustered in time and space in some way. Access to files by contrast is often sequential; in particular, expecially under the BSD filesystem, sequential access is very much favoured, so Unix applications tend to use sequential access even when other file structures and access patterns could be used (copying a file in its entirety is often preferred to updating it in place). In particular, the original V7 filesystem did provide read ahead and write behind, and the large block sizes introduced by BSD essentially provide more of the same. Unfortunately a FIFO (sequential file) access pattern tends to go against the grain of a LIFO (LRU approximating) virtual memory policy. In particular, when reading a file sequentially, the most recently accessed block is the one least likely to be used again in the near future, while the virtual memory subsystem assumes exactly the opposite. BSD and SunOS, by the way, do provide a system call to advise the paging subsystem that pages recently referenced will not be reused shortly, but major applications don't use it at all (e.g. not 'cp'). For example, the 'stdio' library could profitably use it on all those files that are not opened for read/write, and in particular those that are opened for write or append only, as available statistics show that this happens fequently and such a file is usually accessed strictly sequentially. There used to be a system call to suppress not just virtual memory keep behind, but also to ask for fault-in ahead, i.e. to knowingly circumvent the 'on-demand' principle of virtual memory management. It has apparently disappeared, replaced by ever larger pages, which give some form of read ahead, but are a big lose for small files and random access (as Thompson & Ritchie observed long ago). -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk