Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!caip!clyde!burl!ulysses!mhuxr!mhuxt!houxm!ihnp4!inuxc!pur-ee!uiucdcs!ccvaxa!aglew
From: aglew@ccvaxa.UUCP
Newsgroups: net.arch
Subject: Re: VERY LARGE main memories: crypt
Message-ID: <5100131@ccvaxa>
Date: Sat, 13-Sep-86 18:36:00 EDT
Article-I.D.: ccvaxa.5100131
Posted: Sat Sep 13 18:36:00 1986
Date-Received: Sun, 14-Sep-86 20:11:44 EDT
References: <15505@ucbvax.BERKELEY.EDU>
Lines: 139
Nf-ID: #R:ucbvax.BERKELEY.EDU:15505:ccvaxa:5100131:000:6515
Nf-From: ccvaxa.UUCP!aglew    Sep 13 17:36:00 1986


...> Large memory making paging unnecessary.
...> Discussion of page replacement strategies.

...> OS designers being "too lazy" to develop good policies for scientific
...> code, vs. "it's easy" to predict data usage patterns.
The point is that OS mechanism are nearly always responsive, not predictive.
They observe a usage pattern, and assume that it is going to go on for a
while. LRU is a particularly bad example, for scientific codes, but it
does almost as badly for some AI type problems. Consider a large tree walk
- you want to use LRU for pages above you in the tree, but before throwing
any of those out you want to throw pages below you out that you've already
visited out first. Explicit code of the form

		visit(node)
		    prepare-for-prefetch(right,LRU)
		    visit(left)
		    throw-away(left)
		    visit(right)
		    throw-away(right)

is easy and obvious (though it gets more complicated if you try for left
prediction). And it accurately _predicts_, instead of responding.

If you've done control systems, you'll know that the ideal feedback mechanism
has negative delay, but in the real world you need integrators. Knowledge of
the algorithm is like knowing the correlation of your noise - it's not great,
but it's better than nothing.

A useful discussion might be, how can we extend vadvise() to be more useful?
A simple vadvise(CYCLIC) isn't satisfactory, since the operating system would
have to figure out the bounds of the cyclicaly accessed region. But maybe
vadvise(CYCLIC,a,b,d), where a-d is the bounds, and b indicates the dimension
of the wavefront. More primitive, a call I'm-going-to-need-this-page-soon()
and I'm-never-going-to-need-this-page-again() might be useful. For
abstraction, say memory region (indicated by bounds) instead of page.

Such things probably do not need to be system calls - they could be library
functions, although with a lot of system knowledge built in. As for the
actual page strategies used by hardware, they are likely to become simpler,
not more complex, since the devices you are paging from are getting faster.
Consider paging from a non-directly-addressible bank of `slow' CMOS memory
on an uninterruptible power supply...

...> What about code usage patterns?
I've seen automatic overlay generators on small systems - I even used one,
once, but not for long. It had the option of generating overlays completely
automatically. This was particularly easy to do for the top level of
classic Pascal programs:
	    PROGRAM foo;
		BEGIN
		    init();
		    stage1();
		    stage2();
		    stage3();
		    cleanup();
		 END.
where the overlay strategy for the top level is obvious if the stages don't
share to many routines - even if they do, call-graph decomposition isn't
too hard. And, as above, this was predictive, not responsive.
		
...> John Krueger: can we measure the win?
I'm sure my coworkers will correct me, but I've heard of as much as 50%
speed improvement using vmadvise() for non-scientific applications. More
precise control will probably do even better. Can anybody at NASA respond?

			    ---

I'd like to break for discussion for a bit. I'm fairly sure that memories
large enough to drastically reduce the importance of paging are imminent.
I am not so certain of how the page tables, the address translation
mechanism, will respond. How big can we make our TLBs? How quickly can we
handle TLB misses?

Certainly, tree structured page tables are unlikely to be any good. How about
discussion of inverted page tables, hashing, O(1) probabilistic translation
on TLB misses? Are the hash tables going to have to be so large that they
themselves will be paged? - something I do not like because it almost
definitely requires microcode.

As for TLBs, maybe we can squeeze more into them by comparing for groups of
pages instead of individual pages. I've proposed a scheme for power of two
sized page ranges that makes the range check into a strict AND/OR, carryless
operation - which isn't bad, except that it adds 2-3 gate delays to the most
critical path of the machine. I know the tradeoff curves, but I don't know
when we will cross them. Are there any other ideas out there?

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms


In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes:
>It's not just that the operating system or hardware designers are too lazy
>to come up with a good scheme.  The problem is that any scheme they DO come
>up with must work for general cases.  That is, it can't take advantage of
>special knowledge of a specific algorithm....The individual applications
>programmer CAN take advantage of such knowledge.  To be sure, this is an
>expensive and difficult programming project...[but] there are some types of
>algorithm for which it is extremely easy to predict the data usage
>patterns.  Now, it might be convenient to implement virtual memory schemes
>which are useful in this context, but I doubt that the extra overhead in the
>memory interface would be justified - especially since the explicit methods
>for dealing with them are fairly easy to implement.

1) Doubtless you can show me cases where it's "extremely easy" to predict
data usage patterns.  Can you show me one where it's easy to predict the
code usage patterns?  I want VM to free me from overlays, or code space
management, not file structuring, or data space management.

>if you've just spent $10-$20 million on a fast machine, you aren't going
>to balk at a few million more in programmer man-hours to get the speed
>that you shelled out so much cash for.
2) Can we measure the win?  Can you provide figures on performance
improvements for either data or code space management by application
programmers over mechanisms provided by operating system or hardware
designers?  Have you any actual examples, how much was the improvement for a
specific application you're familiar with?  Can we state a general rule,
expected returns on applications programmers managing their own code and/or
data spaces?  Can you state the breakeven point, how many millions of
dollars I should be willing to spend before I improve on the operating
system's paging?

					-- jon
-- 
--> Jon Krueger
uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba
Phone: (716) 275-2811 work, 473-4124 home	BITNET: TUBA@UORDBV
USMAIL:  Taylor Hall, University of Rochester, Rochester NY  14627 
/* End of text from ccvaxa:net.arch */