Path: utzoo!attcan!uunet!portal!cup.portal.com!bcase
From: bcase@cup.portal.com
Newsgroups: comp.arch
Subject: Re: Register Windows (was Re: Japanese...)
Message-ID: <9864@cup.portal.com>
Date: 8 Oct 88 18:50:55 GMT
References: <58@zeno.MN.ORG> <91@zeno.MN.ORG> <ANDREW.88Sep28160417@jung.ha
Organization: The Portal System (TM)
Lines: 123
XPortal-User-Id: 1.1001.5156

First off, let me appologize to the net in general for starting this
discussion at all and particularly for apparently starting it off with
a touch of flame.  My intent was not to flame or get personal, but I
did want to say what I really thought was right.  Believe me, I'll drop
it after this because I am clearly not making myself understood and am
probably wrong anyway.  My intent is not to piss-off anyone; I am not
that kind of person (usually :-).  I might meet anyone here at a conference
or whatever some day; I don't want the reaction to be "oh, yeah, that 
flaming *sshole from portal."

Hank Dietz (Hope I spelled the name right, our mailer doesn't include the
name automagically) writes:
>In article <9725@cup.portal.com>, bcase@cup.portal.com writes:
>> ...  True, if you could look ahead at
>> the instruction stream, predict correcly all the conditional branches
>> etc., then you could know exectly where to insert the lazy loads and
>> stores.  But I think this is unrealistic.
>You don't seem to know what lazy operations are.  A lazy operation is NOT a
>simple "delayed" load or store, and it isn't triggered by an opcode being
>executed; it is an operation which is automatically triggered either when
>certain conditions exist which favor its efficient completion or when its
>result is required --  one does NOT insert lazy operations in anything.

No, I do understand what lazy operations are; the word "insert" as I have
used it refers to "inserting" the loads/stores into the pipeline.  I think
I understand what lazy means here because we considered using this technique
(call it lazy or dribble-back or whatever; BTW, it was called dribble-back
a long time ago, I don't know who named it) in the implementation of the 29K.
That was in the very early days of the design.  We must not have had access
to the info you have now because we considered it too expensive or complicated
or something to implement.  At least I think I understand what lazy means;
correct me if I am wrong.

>Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die
>space consumed by them).  What do you do when the 9th nested call is made?
>The 10th?  You do a sit-and-wait-for-it burst store, that's what...  would
>you really describe that as being "*without any* memory references at all!"

No!  Of course not; clearly that has memory references!  The research that
I did at AMD (I instrumented the PCC to count the number and size of overflows
and underflows the VAX would have if it had a register file like the ones
we were considering for the 29K; I modified applications like vi and circuit
design tools to allocate their large, non-scalar local items from the heap
instead of from the stack and then compiled and ran them.  At exit, they
they spit out stats) showed that overflowing and underflowing would happen
infrequently; the actual results from simulations and real chips supports
my research conclusions.  Thus, yes, overflows do happen, but they happen
infrequently in real life (there are pathological cases for both schemes,
though).  What I meant is that on average, several "windows" (the 29K
doesn't have "windows") can capture the working set of the scalar run-time
stack, and can effect most of the calls/returns with no memory references.
Sorry for the confusion.

Your comment about the windows taking one heck of a lot of valuable die
space is correct; I happen to believe that that valuable die space is
occupied by a valuable resource:  registers.  Let me say for the record
that I prefer the 29K approach, where all registers, or a very large
number of them, are simultaneously addressable, over the SPARC approach
where only one window's worth is addressable at any one time.  I feel
that registers are a very performance-improving resource, one for which
there is no substitute.  They are the fastest part of the memory
hierarchy and have three times the bandwidth of a single-ported data
cache, etc. etc. etc.

>Now, you can argue that there might not be enough time between calls or
>between returns to lazily store or reload a set, but that's unlikely
>because:
>Calls:		The lazy stores only have to store registers which are live
>		and dirty, i.e., whose value will be referenced after return
>		from the call and also is not the same as a value stored
>		somewhere in memory (e.g., a variable) by the programmer's
>		code.  In most RISC processors, the only instructions able
>		to make a register dirty are register = register op
>		register...  which all have a free memory reference cycle!
>		In the worst case, you'd lag behind by just one register
>		store, which sure beats doing a non-lazy burst every so
>		often. QED.

I think QED is a bit strong.  If you can prove that the lazy scheme "sure
beats doing a non-lazy burst every so often," then I am wrong, of course.
But your saying it doesn't make it so.  As I said in a previous posting,
I think seeing the algorithm, which can't be complicated or it wouldn't
be suitable for hardware implementation, might end the discussion; and if
it is good, it will certianly make me look foolish.  Maybe that's what I
need, although it wouldn't be the first time....

Let me take one last crack at making my point:  loads/stores tyically take
two or more cycles.  Hardware will have to be very clever about finding
the right place to put the lazy loads/stores.  I think the added lazy
loads/stores will bring the data memory channel utilization up to near
100% (which can be a good thing if doing so does not slow down the general
rate of load/store completion; pipelining might solve the problem...).
Thus, I believe that the lazy loads/stores will, even assuming the optimal
placement in time, interfere with the loads/stores actually required by the
algorithm being executed.  This interference will lower the benefit of lazy
loads/stores to bring this scheme at least on equal footing with the non-lazy,
burst scheme.  Since the burst scheme is simpler, I would prefer it.  But the
lazy scheme uses fewer registers, so you prefer it instead.  Since the lazy
scheme does not take advantage of the benefits afforded the sequential access
pattern of the burst scheme, I would not choose it (since I know that I
can make the burst scheme fast).  Since the lazy scheme makes better
utilization of the memory channel (closer to 100%), you would choose it.

Or whatever.  I am not saying that these comments are God's absolute truth.
I am just trying to make clear my original point.  The stuff about predicting
the future was just my (apparently stupid) way of saying that figuring out
where to insert the lazy loads/stores is very difficult (at least that's what
I think).  And if you don't put them in the best places, you will interfere
with loads/stores required by the algorithm at hand.  If interference happens,
which I believe will, then I don't see the advantage of laziness over
explicitness.  At least the explicit scheme can take advantage of high-
bandwidth sequential access memories.  There, now I think I have made myself
clear.  Wrong maybe, but clear.

>The static/dynamic tradeoffs are what my research group,
>CARP, is all about.

This is valuable research; I didn't mean to attack your group or your
research directions either directly or indirectly.  I was once in grad
school; you are probably all thinking "I swear, some of those industry-
types are truly complete jerks."  I know *I* thought that about industry-
types more than a few times!  Sigh, I guess we all eventually become that
which we despise!  (Parents?, e.g.)