Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!caip!clyde!burl!ulysses!mhuxr!mhuxt!houxm!ihnp4!inuxc!pur-ee!uiucdcs!ccvaxa!aglew From: aglew@ccvaxa.UUCP Newsgroups: net.arch Subject: Re: VERY LARGE main memories: crypt Message-ID: <5100131@ccvaxa> Date: Sat, 13-Sep-86 18:36:00 EDT Article-I.D.: ccvaxa.5100131 Posted: Sat Sep 13 18:36:00 1986 Date-Received: Sun, 14-Sep-86 20:11:44 EDT References: <15505@ucbvax.BERKELEY.EDU> Lines: 139 Nf-ID: #R:ucbvax.BERKELEY.EDU:15505:ccvaxa:5100131:000:6515 Nf-From: ccvaxa.UUCP!aglew Sep 13 17:36:00 1986 ...> Large memory making paging unnecessary. ...> Discussion of page replacement strategies. ...> OS designers being "too lazy" to develop good policies for scientific ...> code, vs. "it's easy" to predict data usage patterns. The point is that OS mechanism are nearly always responsive, not predictive. They observe a usage pattern, and assume that it is going to go on for a while. LRU is a particularly bad example, for scientific codes, but it does almost as badly for some AI type problems. Consider a large tree walk - you want to use LRU for pages above you in the tree, but before throwing any of those out you want to throw pages below you out that you've already visited out first. Explicit code of the form visit(node) prepare-for-prefetch(right,LRU) visit(left) throw-away(left) visit(right) throw-away(right) is easy and obvious (though it gets more complicated if you try for left prediction). And it accurately _predicts_, instead of responding. If you've done control systems, you'll know that the ideal feedback mechanism has negative delay, but in the real world you need integrators. Knowledge of the algorithm is like knowing the correlation of your noise - it's not great, but it's better than nothing. A useful discussion might be, how can we extend vadvise() to be more useful? A simple vadvise(CYCLIC) isn't satisfactory, since the operating system would have to figure out the bounds of the cyclicaly accessed region. But maybe vadvise(CYCLIC,a,b,d), where a-d is the bounds, and b indicates the dimension of the wavefront. More primitive, a call I'm-going-to-need-this-page-soon() and I'm-never-going-to-need-this-page-again() might be useful. For abstraction, say memory region (indicated by bounds) instead of page. Such things probably do not need to be system calls - they could be library functions, although with a lot of system knowledge built in. As for the actual page strategies used by hardware, they are likely to become simpler, not more complex, since the devices you are paging from are getting faster. Consider paging from a non-directly-addressible bank of `slow' CMOS memory on an uninterruptible power supply... ...> What about code usage patterns? I've seen automatic overlay generators on small systems - I even used one, once, but not for long. It had the option of generating overlays completely automatically. This was particularly easy to do for the top level of classic Pascal programs: PROGRAM foo; BEGIN init(); stage1(); stage2(); stage3(); cleanup(); END. where the overlay strategy for the top level is obvious if the stages don't share to many routines - even if they do, call-graph decomposition isn't too hard. And, as above, this was predictive, not responsive. ...> John Krueger: can we measure the win? I'm sure my coworkers will correct me, but I've heard of as much as 50% speed improvement using vmadvise() for non-scientific applications. More precise control will probably do even better. Can anybody at NASA respond? --- I'd like to break for discussion for a bit. I'm fairly sure that memories large enough to drastically reduce the importance of paging are imminent. I am not so certain of how the page tables, the address translation mechanism, will respond. How big can we make our TLBs? How quickly can we handle TLB misses? Certainly, tree structured page tables are unlikely to be any good. How about discussion of inverted page tables, hashing, O(1) probabilistic translation on TLB misses? Are the hash tables going to have to be so large that they themselves will be paged? - something I do not like because it almost definitely requires microcode. As for TLBs, maybe we can squeeze more into them by comparing for groups of pages instead of individual pages. I've proposed a scheme for power of two sized page ranges that makes the range check into a strict AND/OR, carryless operation - which isn't bad, except that it adds 2-3 gate delays to the most critical path of the machine. I know the tradeoff curves, but I don't know when we will cross them. Are there any other ideas out there? Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms In article <7331@lanl.ARPA> jlg@a.UUCP (Jim Giles) writes: >It's not just that the operating system or hardware designers are too lazy >to come up with a good scheme. The problem is that any scheme they DO come >up with must work for general cases. That is, it can't take advantage of >special knowledge of a specific algorithm....The individual applications >programmer CAN take advantage of such knowledge. To be sure, this is an >expensive and difficult programming project...[but] there are some types of >algorithm for which it is extremely easy to predict the data usage >patterns. Now, it might be convenient to implement virtual memory schemes >which are useful in this context, but I doubt that the extra overhead in the >memory interface would be justified - especially since the explicit methods >for dealing with them are fairly easy to implement. 1) Doubtless you can show me cases where it's "extremely easy" to predict data usage patterns. Can you show me one where it's easy to predict the code usage patterns? I want VM to free me from overlays, or code space management, not file structuring, or data space management. >if you've just spent $10-$20 million on a fast machine, you aren't going >to balk at a few million more in programmer man-hours to get the speed >that you shelled out so much cash for. 2) Can we measure the win? Can you provide figures on performance improvements for either data or code space management by application programmers over mechanisms provided by operating system or hardware designers? Have you any actual examples, how much was the improvement for a specific application you're familiar with? Can we state a general rule, expected returns on applications programmers managing their own code and/or data spaces? Can you state the breakeven point, how many millions of dollars I should be willing to spend before I improve on the operating system's paging? -- jon -- --> Jon Krueger uucp: {seismo, allegra, decvax, cmcl2, topaz, harvard}!rochester!ur-tut!tuba Phone: (716) 275-2811 work, 473-4124 home BITNET: TUBA@UORDBV USMAIL: Taylor Hall, University of Rochester, Rochester NY 14627 /* End of text from ccvaxa:net.arch */