Path: utzoo!attcan!uunet!portal!cup.portal.com!bcase From: bcase@cup.portal.com Newsgroups: comp.arch Subject: Re: Register Windows (was Re: Japanese...) Message-ID: <9864@cup.portal.com> Date: 8 Oct 88 18:50:55 GMT References: <58@zeno.MN.ORG> <91@zeno.MN.ORG> In article <9725@cup.portal.com>, bcase@cup.portal.com writes: >> ... True, if you could look ahead at >> the instruction stream, predict correcly all the conditional branches >> etc., then you could know exectly where to insert the lazy loads and >> stores. But I think this is unrealistic. >You don't seem to know what lazy operations are. A lazy operation is NOT a >simple "delayed" load or store, and it isn't triggered by an opcode being >executed; it is an operation which is automatically triggered either when >certain conditions exist which favor its efficient completion or when its >result is required -- one does NOT insert lazy operations in anything. No, I do understand what lazy operations are; the word "insert" as I have used it refers to "inserting" the loads/stores into the pipeline. I think I understand what lazy means here because we considered using this technique (call it lazy or dribble-back or whatever; BTW, it was called dribble-back a long time ago, I don't know who named it) in the implementation of the 29K. That was in the very early days of the design. We must not have had access to the info you have now because we considered it too expensive or complicated or something to implement. At least I think I understand what lazy means; correct me if I am wrong. >Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die >space consumed by them). What do you do when the 9th nested call is made? >The 10th? You do a sit-and-wait-for-it burst store, that's what... would >you really describe that as being "*without any* memory references at all!" No! Of course not; clearly that has memory references! The research that I did at AMD (I instrumented the PCC to count the number and size of overflows and underflows the VAX would have if it had a register file like the ones we were considering for the 29K; I modified applications like vi and circuit design tools to allocate their large, non-scalar local items from the heap instead of from the stack and then compiled and ran them. At exit, they they spit out stats) showed that overflowing and underflowing would happen infrequently; the actual results from simulations and real chips supports my research conclusions. Thus, yes, overflows do happen, but they happen infrequently in real life (there are pathological cases for both schemes, though). What I meant is that on average, several "windows" (the 29K doesn't have "windows") can capture the working set of the scalar run-time stack, and can effect most of the calls/returns with no memory references. Sorry for the confusion. Your comment about the windows taking one heck of a lot of valuable die space is correct; I happen to believe that that valuable die space is occupied by a valuable resource: registers. Let me say for the record that I prefer the 29K approach, where all registers, or a very large number of them, are simultaneously addressable, over the SPARC approach where only one window's worth is addressable at any one time. I feel that registers are a very performance-improving resource, one for which there is no substitute. They are the fastest part of the memory hierarchy and have three times the bandwidth of a single-ported data cache, etc. etc. etc. >Now, you can argue that there might not be enough time between calls or >between returns to lazily store or reload a set, but that's unlikely >because: >Calls: The lazy stores only have to store registers which are live > and dirty, i.e., whose value will be referenced after return > from the call and also is not the same as a value stored > somewhere in memory (e.g., a variable) by the programmer's > code. In most RISC processors, the only instructions able > to make a register dirty are register = register op > register... which all have a free memory reference cycle! > In the worst case, you'd lag behind by just one register > store, which sure beats doing a non-lazy burst every so > often. QED. I think QED is a bit strong. If you can prove that the lazy scheme "sure beats doing a non-lazy burst every so often," then I am wrong, of course. But your saying it doesn't make it so. As I said in a previous posting, I think seeing the algorithm, which can't be complicated or it wouldn't be suitable for hardware implementation, might end the discussion; and if it is good, it will certianly make me look foolish. Maybe that's what I need, although it wouldn't be the first time.... Let me take one last crack at making my point: loads/stores tyically take two or more cycles. Hardware will have to be very clever about finding the right place to put the lazy loads/stores. I think the added lazy loads/stores will bring the data memory channel utilization up to near 100% (which can be a good thing if doing so does not slow down the general rate of load/store completion; pipelining might solve the problem...). Thus, I believe that the lazy loads/stores will, even assuming the optimal placement in time, interfere with the loads/stores actually required by the algorithm being executed. This interference will lower the benefit of lazy loads/stores to bring this scheme at least on equal footing with the non-lazy, burst scheme. Since the burst scheme is simpler, I would prefer it. But the lazy scheme uses fewer registers, so you prefer it instead. Since the lazy scheme does not take advantage of the benefits afforded the sequential access pattern of the burst scheme, I would not choose it (since I know that I can make the burst scheme fast). Since the lazy scheme makes better utilization of the memory channel (closer to 100%), you would choose it. Or whatever. I am not saying that these comments are God's absolute truth. I am just trying to make clear my original point. The stuff about predicting the future was just my (apparently stupid) way of saying that figuring out where to insert the lazy loads/stores is very difficult (at least that's what I think). And if you don't put them in the best places, you will interfere with loads/stores required by the algorithm at hand. If interference happens, which I believe will, then I don't see the advantage of laziness over explicitness. At least the explicit scheme can take advantage of high- bandwidth sequential access memories. There, now I think I have made myself clear. Wrong maybe, but clear. >The static/dynamic tradeoffs are what my research group, >CARP, is all about. This is valuable research; I didn't mean to attack your group or your research directions either directly or indirectly. I was once in grad school; you are probably all thinking "I swear, some of those industry- types are truly complete jerks." I know *I* thought that about industry- types more than a few times! Sigh, I guess we all eventually become that which we despise! (Parents?, e.g.)