Path: utzoo!utgpu!attcan!uunet!lll-winken!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!uxg.cso.uiuc.edu!uicbert.eecs.uic.edu!wilson From: wilson@uicbert.eecs.uic.edu Newsgroups: comp.lang.lisp Subject: Re: big lisp, locality, and multiproces Message-ID: <63200005@uicbert.eecs.uic.edu> Date: 6 Jan 89 00:07:00 GMT References: <63200004@uicbert.eecs.uic.edu> Lines: 190 Nf-ID: #R:uicbert.eecs.uic.edu:63200004:uicbert.eecs.uic.edu:63200005:000:10836 Nf-From: uicbert.eecs.uic.edu!wilson Jan 5 18:07:00 1989 Thanks to those who replied to the posting on locality and lisp and multiprocessors. I got more questions than answers, though, so I'll post some clarifications of what I'm talking about. This is long, but in skippable chunks about different topics. ----------------------------------------------------------------------- About the adaptive incremental gc to improve locality: Bob Courts' adaptive incremental gc is the one used on the TI Explorer, and it was described in the September 1988 CACM ("Improving Locality of Reference in a Garbage-Collecting Memory Management System," _CACM_ 31,9, pp 1128-1138.) This garbage collector is similar to Moon's Ephemeral Garbage Collector (for background, you might want to look at his paper in the 1984 Lisp and Functional Programming conference proceedings.) Both of these gc's are incremental in the basic manner of Baker's algorithm. (After a generation has been flipped, the program can resume while objects are still being transitively traversed and copied from fromspace to tospace. This could cause trouble if a program reached the old version of an object that had already been transported by a pointer to it that had not yet been updated. To avoid that, the program is never allowed to see a pointer into fromspace -- if the program grabs such a pointer from memory, it is transparently "fixed" before the program is allowed to use it. That is, the object in fromspace is examined; if it has not been moved to tospace yet, it is moved; either way, the program is handed a pointer to the new version of the object (in tospace) and allowed to resume.) The nice thing about such an system is that the running program causes fromspace objects that it references to be transported to tospace *in the order that it first touches them*. This reorganizes the objects with very good locality. The standard Baker algorithm doesn't derive this benefit because of another aspect of its design. All of these systems must have a "background" scavenger that traverses *all* of the live data that are not reached by the running program, to ensure that they too are copied to tospace. (You have to do this because the program may not touch such an object during an incremental gc, but do so afterwards.) The problem with the Baker algorithm is that it interleaves the data copied by the background scavenger with data that were copied by the running program. This interleaves inactive data with active data, degrading locality. The Explorer GC copies data reached by the running program into a different area than data reached first by the background scavenger. There are several other refinements, but that's the basic idea. Frequently accessed data are likely to be encountered first by the running program and are copied in order of access to one area of memory. Other data are likely to be encountered first by the scavenger, and copied to another. Unfortunately, fine-grained incremental collection causes considerable overhead if you don't have some special-purpose hardware (as Lisp machines do) to check for fromspace pointers in parallel. On the other hand, the RAM costs appear to be greatly reduced by Courts' technique, so a little bit of processor complexity may save hundreds or thousands of dollars in RAM costs. ----------------------------------------------------------------------- About the irreducible cost of garbage collection: Lisp code often uses heap allocation for things that would be stack- allocated in other languages. In Pascal, for instance, stack-allocated data are reclaimed immediately on procedure exit rather than at some indefinite point in the future (i.e., scavenge time). With a generational garbage collector, you can approximate this arbitrarily closely by garbage collecting your youngest generation more often. This becomes prohibitive,however, if taken too far. You end up copying other data too soon, without giving them time to die instead. The data that in Pascal would be stack allocated force more frequent scavenges, which means you copy the "true" heap data more often than necessary. ("True" heap data being data that would be heap allocated in Pascal as well.) To some degree, static analysis by a compiler can be used to allocate these objects on the stack, but the problem generally replicates itself at larger time scales. (The analysis is likely to help cache-scale locality but not the pagewise locality in older, larger generations.) One problem with most copying gc's is that they're very good at reclaiming the space used by short-lived data generated by applicative- style programming, but they're not particularly smart about data whose lifetimes are closely tied to procedure calls and returns. (I talk about this, and some things to do about it, in a working paper in the last regular issue of SIGPLAN Notices. The basic idea was put forth in Hewitt and Lieberman's original 1983 CACM paper on generational garbage collection; my paper is largely about efficient implementation tricks.) A related but more general problem is that most gc's lose efficiency by having awkward policies for advancing data from one generation to the next. All of the collectors I've mentioned so far advance all copied data to the next generation at every scavenge. This means that brand- new data are advanced along with data that were allocated just before the previous scavenge. Judging by the empirical data I've seen, *most* of these data die comparatively soon thereafter, so that most of your memory is holding garbage much of the time. This degrades locality considerably, I suspect. Ungar's Generation Scavenging system has more precise control over advancement because it associates a count with every object. The count reflects the number of times an object has survived a scavenge in a generation, and the object is advanced to the next generation when its count exceeds a threshold. This avoids most premature advancement, but other aspects of Ungar's design are troublesome. In the original system, at least, there were only two generations. This is part of Ungar's strategy of avoiding the need for an incremental collector, which is expensive on stock hardware. Only the new generation is scavenged during a user session, and it is hoped that the old generation will not fill up during a user session; it is only gc'd when the user is not using the system (e.g., at night). The new generation is kept small enough that collecting it does not take very long, and the user will not generally notice it much. This strategy appears to have two problems: 1) some programs generate a lot of data that survive long enough to be advanced; this fills up old memory and forces a time-consuming gc of a large generation during a user session, and 2) postponing any scavenging of this large, old generation degrades locality because it fills "intermediate lifetime data" that become garbage soon on the timescale of the old generation. Ungar and Jackson discuss this in their 1988 OOPSLA paper. The system I describe in the above-mentioned SIGPLAN Notices paper also addresses this issue, from a different angle. My approach is to go ahead and have more generations, but to schedule scavenges cleverly so that a) the user is unlikely to notice a pause, and b) the pauses are generally shorter because the system is more efficient. (The Tektronix implementation of Smalltalk (described in Caudill & Wirfs-Brock's 1986 OOPSLA paper) uses several generations with a guarantee of age, but doesn't attempt to schedule scavenges cleverly.) A drawback to Ungar's scheme (and Tektronix') is the storage and manipulation of per-object counts. In a forthcoming paper I discuss a different technique of guaranteeing age precision, which is to use a "bucket brigade" of spaces to segregate objects *within* a generation by approximate age. I argue that in a system with several generations, a very simple system with two buckets is not only efficient and easy to implement, it's what you probably want anyway. (This paper will be in SIGPLAN Notices in a while, but I can send copies to anybody who's interested.) I'm currently implementing my design for Scheme-48, and it will serve as a testbed for the ideas here and several others. ---------------------------------------------------------------------- About problems with parallel Lisps: I don't actually know how parallel Lisp systems are different from other parallel systems, but I expect the issues are many and subtle. For example, suppose you have adapted a Courts-style adaptive collector in a fairly straightforward way to run on a shared-memory multiprocessor. The objects that will *always* migrate into the most-active region of the oldest generation are those that are shared and frequently accessed by all processes. (No matter what process is running during an incremental scavenge, it will access and transport those data to the most active region of the next generation.) But maybe these are exactly the data you *don't* want all lumped together, at least at the resolution of cache blocks, because all of the processors will be trying to access them frequently. If two such data-objects-of-contention live in the same cache block, the contention goes up unnecessarily. If you have a lot of preemptive multitasking, you can also just confuse both your cache block and page-replacement algorithms. If you do a lot of thread switching with Courts' collector, you'll get a lot of "phase changes" that aren't likely to be repeated -- sequences of objects accessed by one processor interleaved with sequences accessed by another. The next time the process executes, it's likely to be preempted at a different point in its execution by a different process, so unrelated data have gotten interleaved. If one of the processes is migrated to another machine, the contention could be pretty bad (I guess). Heavyweight processes avoid this because the data don't even exist in the same address space. (Not that I think it's necessarily worth the costs.) But I don't really know much about this, or how people deal with such issues (which is why I asked for references and views on these things). One important question is whether a garbage collector lumps unrelated things together, so that even if processors are working on unrelated things, they contend for data access or swap in useless stuff. ------------------------------------------------------------------ So that's what I'm thinking about. What do others think? -- Paul Paul R. Wilson Electronic Mind Control Laboratory* off. ph. (312) 413-0042 U. of Illin. at C. EECS Dept. (M/C 154) wilson%uicbert@uxc.cso.uiuc.edu Box 4348 Chicago,IL 60680 (* a.k.a. Human Computer Interaction Lab)