Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!genrad!decvax!ucbvax!sdcsvax!darrell
From: darrell@sdcsvax.UUCP
Newsgroups: comp.os.research,mod.os
Subject: Life with TLB and no PT
Message-ID: <3027@sdcsvax.UCSD.EDU>
Date: Wed, 22-Apr-87 06:23:19 EST
Article-I.D.: sdcsvax.3027
Posted: Wed Apr 22 06:23:19 1987
Date-Received: Fri, 24-Apr-87 01:56:23 EST
Sender: darrell@sdcsvax.UCSD.EDU
Organization: U of Rochester, CS Dept., Rochester, NY
Lines: 91
Approved: mod-os@sdcsvax.uucp
Xref: utgpu comp.os.research:5 mod.os:134

Prodded by something Avie Tevanian of the MACH research group said,
I have been considering life with a translation lookaside buffer (TLB)
but without hardware page tables (PT).  This message is intended to
spark some discussion of a) what such a system would be like, and
b) how existing architectures and operating systems can be instrumented
to get some empirical data to guide TLB design and sizing.  Since it
involves tradeoffs between hardware and software, I cross-posted to
comp.os.research and comp.arch.
  
Given a good TLB design which doesn't require frequent flushing of
entries, it is possible to do without hardware PT entirely.  I don't
know of anybody working on a machine with TLB and no PT, but I wouldn't
be at all surprised to see one in the next few years.  The trick is to
increase the TLB hit rate to the point where you can afford to provide
translation information entirely through software.  If you do that, you
can keep any amount of "PT" information and implement all kinds of nice
things like shared segments of variable size, sparse address spaces, etc.

Another benefit is reduced hardware/microcode complexity and therefore
reduced cost.  I imagine that a TLB of a given design can be made lots
bigger for less money than it would take to (re)implement hardware/microcode
address translation.  Large address spaces that require paging to support the
PT, like the VAX (two levels) or the 432 (is it really SEVEN levels?),
get translated in inexpensive, flexible software rather than expensive,
hard-to-change hardware/microcode.

Is a TLB/no-PT architecture at all feasible?  If you look at the VAX
TLB studies published in TOCS, it seems clear that the VAX TLB design
can reasonably be sized to reduce misses to just under one percent.
This is good, but not good enough.  The optimum would be to reduce
misses to just those addresses which are not in physical memory;  at
that point the software has to get into the act to process a page fault
anyway.  On the other hand, even if there's a TLB entry for every
physical page frame, if there's sharing between contexts, you still
don't get complete TLB coverage of the physically resident pages.
(This suggests segmented address spaces (MULTICS-style, not
80?86-style), and sharing segments rather than pages, but I will assume
linear address spaces.) So one question to answer is: how high a TLB
miss rate can be tolerated?  A better way to state it is:  how much
time can be devoted to address translation?  Then the tolerable TLB
miss rate is a function of the cost and complexity of doing an address
translation, which is a function of the operating system (virtual
machine) rather than the machine architecture (real machine).

One thing that is clearly needed for a PT-less TLB system, or even a
PT-full TLB system with good performance, is a set of context tags
(process identifiers) so that TLB entries do not need to be flushed on
every context switch.  Some VAX TLB implementation distinguish system
and user contexts.  The memory management hardware in the Sun-2
architecture has 8 contexts, but no TLB.  There is a lot of room to
experimentation and improvement here.  How many tags should there
be?  Which contexts (processes) should be have tags at a given moment?
The first question is architecture, the other is operating systems.

Consider software for a moment.  One of the things everybody preaches
is to move utility services out of the machine kernel and into
"user-level" server processes.  One of the things nobody does is, yes,
just that.  The cost in context switching is too high for high
performance of system services.  From this observation, it is clear
that you want enough tags to handle, not just the processes ready to
run, but also all those servers which are in some queue waiting for
requests.  On the other hand, you can't afford unbounded TLB hardware.
Where is the point of diminishing returns?  Obviously the answer
depends on many factors, and even a detailed analysis of the factors
would be useful.

You also have the classic problem of allocating a resource (TLB
entries) among different users (contexts).  Extending the virtual
address to include the context tag amounts to statically allocating
each context a fixed share of the TLB.  I can't imagine that this
leads to effective use of the TLB for even 8 contexts.  On the other
hand, what are the costs of deciding which entries, from which
contexts, to discard?  Working set theory indicates that at some point
it is better to throw away entries from the current context, rather
than another context.

Rather than speculate about just where the tradeoffs are, I would like
to spark some discussion of how existing systems could be instrumented
to empirically discover how many tags would pay off, similarly to the
way systems have been instrumented to measure how much a given size or
associativity of TLB would pay off.  It's not obvious that measurements
of, say, BSD UNIX, would really be applicable to a system with the same
services, but IPC, file system, etc, etc, stripped out of the kernel
into user processes.  Still, some empirical data is better than none.
I am interested in what events and situations to look for on a generic
modern OS, rather than details about what code to poke in FooOS.
That's boring without some principled idea what to look for...

Suggestions?  Ideas?  Experiences???

Stu Friedberg  {seismo, allegra}!rochester!stuart  stuart@cs.rochester.edu