Path: utzoo!news-server.csri.toronto.edu!cs.utexas.edu!uunet!mcsun!ukc!dcl-cs!aber-cs!athene!pcg
From: pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi)
Newsgroups: comp.arch
Subject: Re: Translating 64-bit addresses
Message-ID: <PCG.91Mar9205121@aberdb.cs.aber.ac.uk>
Date: 9 Mar 91 20:51:21 GMT
References: <6590@hplabsz.HP.COM> <12030@pt.cs.cmu.edu> <6626@hplabsz.HP.COM>
	<PCG.91Feb28183457@odin.cs.aber.ac.uk> <Z.U93.7@xds13.ferranti.com>
	<PCG.91Mar5193541@aberdb.cs.aber.ac.uk> <MDY9ETC@xds13.ferranti.com>
Sender: pcg@aber-cs.UUCP
Organization: Coleg Prifysgol Cymru
Lines: 168
Nntp-Posting-Host: aberdb
In-reply-to: peter@ficc.ferranti.com's message of 7 Mar 91 13:31:29 GMT

On 7 Mar 91 13:31:29 GMT, peter@ficc.ferranti.com (Peter da Silva) said:

peter> In article <PCG.91Mar5193541@aberdb.cs.aber.ac.uk>
peter> pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) writes:

pcg> I would however still maintain that even with conventional multiple
pcg> address space architectures shared memory is not necessary, as
pcg> sending segments back and forth (remapping) gives much the same
pcg> bandwidth.

peter> I don't think you can really make a good case for this.

Tell that to the people that ported Mach, and 4.3BSD to the PC/RT... :-)

My estimate is that remapping can be done on demand (lazy remapping),
not on every context switch, and it does not cost more than a page fault
(which is admittedly an extremely expensive operation, but one about
which people don't complain). Also in many cases lazy remapping costs
nothing, given the vagaries of scheduling. Suppose a segment is shared
between processes 1 and 3; if 1 is deactivated, 2 is activated, and 1
is reactivated, no remapping need take place, because 3 has not
accessed it in this sequence.

If 3 had accessed it, the OS would have taken a fault on the attempted
access, found out that the segment was mapped to process 1, unmapped it
from 1, and remapped it onto 3. I correct myself: far less expensive
than a page fault. Probably also less expensive than a process
reschedule, even in a properly designed kernel. You lose a lot only if
you have a very large number of shared segments, which are shared among
a lot but not all processes, and which are all being accessed in every
time slice given to each process that shares them. A very, very, very
unlikely scenario, and one in which after all the cost is proportional
to use, not worse.

Incidentally, avoiding the scenario above is why I think that sharing
single pages as opposed to sharing segments is a bad idea: if each
process sharing a segment of address space touches more the page in it,
a remap fault occurs on each page. I think (and some statistics seem to
support my hunch) that this multiple page access in the same shared
segment is a far more frequent phenomenon than multiple shared segment
access.

peter> Consider the 80286, where pretty much all memory access for large
peter> programs is done by remapping segments.

Are you sure? I think that in all OSes that run on the 286 maybe except
for iRMX/286 segments are not unmapped and remapped, but stay always
mapped, and can be and are shared.

peter> Loading a segment register is an expensive operation,

Around 20-30 cycles if memory serves me right. Compared to a context
switch it is insignificant. And in any case the 286 MMU does support
shared segments directly, so there is no need to do segment remapping to
simulate shared memory.

This said, your comments about the 286 MMU are irrelevant to a
discussion on acceptability of the cost of simulating shared memory by
remapping them on demand or at a context switch in each process that has
them nominally attached. This discussion is important only when
comparing reverse map MMUs with straight map MMUs, and only when the
reverse map MMU does not support (unlike mine) shared segments, and only
when shared segments are deemed useful.

Yet in your discussions of the 286 MMU there are some common fallacies
and myths, and they merit some comment.

Note first that the it is only because of a design misconception (not
quite a mistake) of the 286 designers that loading a segment
register is so expensive. The problem is that the shadow segment
registers are not like TLBs, in that they are reloaded every time, even
if the shadowed segment register *value* has not changed.

This could have easily been avoided by simply comparing the old and new
segment register values. It was not, only because conceivably the
segment descriptors could have been altered even if the value of
ssegment register had not in fact changed, and the 286 has no distinct
"flush shadow segment registers" instruction.

I guess that the designers assumed that in their "Pascal"/"Algol" model
of process execution each segment register was dedicated to a specific
function (code, stack, global, var parameters), and supposed not to be
reloaded often, so no need to treat the shadows as caches.

peter> and is to a large extent the cause of the abysmal behaviour of
peter> large programs on that architecture. For an extreme case, the
peter> sieve slows down by a factor of 11 once the array size gets over
peter> 64K.

This is only because probably the HUGE model gets used, which implies
funny code to simulate 32 bit address arithmetic (the HUGE model is so
expensive because the mistake of putting the ring number in the middle
of a pointer instead of in the most significant bits). On less extreme
examples, or if you code the sieve for the LARGE model, the slowdown is
around 20-50%, even for extremely pointer intensive operations, in the
LARGE model.

Your figure of 11 is plainly ridiculous and warped by the machinations
of the HUGE model; After all 32 bit pointer dereferences are only about
3 times slower than 16 bit pointer dereferences, so even a program
that consisted *only* of them would be only 3 times slower.

Note again that this point about 32 bit pointer arithmetic on a 286 has
*nothing* to do on the cost of simulating shared memory by remapping
when the MMU does not support it directly.

peter> My own experience with real codes under Xenix 286 bears this out.

Maybe. *My* experience of recompiling large large numbers of Unix
nonfloat utilities on a 286 tells me that the average slowdown is around
30%. A 10 Mhz 286 is about the equivalent of a PDP-11/73 (1 "MIPS") in
the small model or of a VAX-11/750 (0.7 "MIPS") in the large model, to
all practical (nonfloat Unix applications :->) purposes.

peter> Think of the 80286 as an extreme case of what you're proposing.

I seem to have completely failed to explain myself. The 286 is
*irrelevant* to a discussion on shared memory simulation by implicit OS
supported or explicit application requested segment remapping (whcih I
prefer).

peter> I think it's clear from this experience that frequent reloading
peter> of segment registers is a bad idea.

No, the conclusion is not supported by the 286 example; the 286 is
uniquely poor for reloading segment registers because it does not treat
shadow segment registers as a cache and because its pointers have an
unfortunate format.

Properly designed MMUs with properly designed TLBs, even reverse map
ones, do segment remapping with small or insignificant cost, not worse
than the 286 MMU. Moreover the real overhead lies not in reloading some
lines in the MMU or the TLB; it is in taking the remap fault and in
searching the appropriate kernel structures to find which (nominally
shared) segment to map in that region.

peter> After your discussion of the inappropriate use of another
peter> technology, networks, I would have expected you'd know better.

I am sorry I got myself so badly misunderstood.

peter> As for single address space machines, my Amiga 1000's exceptional
peter> performance...  given the slow clock speed and dated CPU (7.14
peter> MHz 68000)... tends to suggest that avoiding MMU tricks might be
peter> a good idea here as well.

MMUs are a difficult subject. A lot of vendors have bungled their MMU
designs, the OS code that supports them, and the VM policies that drive
them.  Sun is just *one* of the baddies. That a lot of vendors take many
years to get their act together (if ever) on virtual memory does not
mean that it is a bad technology; it means that maybe it is too subtle
for mere Unix kernel hackers.

peter> The Sparcstation 2 is the first UNIX workstation I've seen with
peter> as good response time to user actions.  It's only a 27 MIPS
peter> machine... or approximately 40 times faster.

The MIPS-eating sun bogons strike again! :-)

The people that did Tripos (Martin Richards!) and Amiga (and those that
now maintain them at CBM) seem to be quite another story. I am another
Amiga fan :-).  Now, if only they could get their act together
commercially... (please redirect the ensuing flame war to the
appropriate newsgroup :->).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk