Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!oakhill!marvin
From: marvin@oakhill.UUCP (Marvin Denman)
Newsgroups: comp.arch
Subject: Re: High-Priority Instructions
Message-ID: <3612@yogi.oakhill.UUCP>
Date: 30 Jul 90 21:01:47 GMT
References: <58428@bbn.BBN.COM> <37310@shemp.CS.UCLA.EDU> <1990Jul27.161856.25701@mozart.amd.com>
Reply-To: marvin@yogi.UUCP (Marvin Denman)
Organization: Motorola Inc., Austin, Texas
Lines: 88

In article <1990Jul27.161856.25701@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes:
>>In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes:
>>> [description of problems with scheduling for writeback slot on 88k deleted]
>>It looks like Motorola did not want to deal with functional unit
>>latencies when instructions are issued.
>>Otherwise they could use a result shift register where the "writeback slot"
>>is reserved in advance according to the latency of the functional unit used.
>>Conflicts are thus resolved in advance. Collisions cause stalling of
>>the issuing unit.
>
>Yep, a pretty straightforward thing to do control-wise.  You also want
>to be able to forward either or both input operands from it, so it's a
>bit more than a simple shift register, but nevertheless nice for handling 
>medium latencies.  You might want to revert to a priority scheme for
>divide though, or serialize.  Another benefit is that it keeps your 
>register file updates in order.  Might have been just a tad too much 
>realestate for them though, assuming they considered it.

I think that posters proposal of a shift register was intended for writeback 
reservations only, but I may have misunderstood. At the time we designed the
88100 we did not seriously consider a shift register of results which is what
Dave Christie seems to be talking about because of circuit complexities.  We 
definitely considered a shift register for write back result reservations, but 
due to several considerations we decided that the arbitration scheme was more
flexible.  The divide latencies would have required a very long shift register
and load cache misses would have required stalling all of the pipelines
negating one of the benefits of interlocks.  The implementation of shift 
register writeback reservation scheme would have been fairly simple so that
is not why we did not do it.

>
>>> [Richard's instruction-based priority bit scheme deleted]
>>
>>Let's see, instructions are held up because other instructions
>>have higher priority, they will proceed only if there is an empty slot.
>>If there is no empty slot, that means that other instructions are producing
>>useful work and that the write-back slot is running at full throughput
>
>But if a certain write slot doesn't satisfy a dependency that exists at
>decode then issue stalls, which will cause an empty write slot downstream
>and lower throughput.  You would typically want to separate a FP instruction
>from a instruction that depended on it; if you throw as many integer
>instructions in between as necessary to compensate for the FP latency,
>those integer instructions just end up stretching out the FP latency
>by taking priority for writeback, and the extra FP latency isn't hidden
>at all - ouch!  The proper priority scheme, IMHO, would give preference
>to earlier instructions in the sequence, which is of course the ones with
>the longer latency.  I'd like the hear their rational for this.  Maybe
>it was arbitrarily chosen and they just expected the compilers to keep
>it all moving (bet they learned a lesson there!).  There's an example
>of rescheduling in an article on the 88K in June's IEEE Micro, but it
>conveniently avoids write slot contention with an appropriate mix of 
>various latency operations. This of course does happen in the real world,
>but there's no mention of what happens when things don't work out so nicely.

Your point about just stalling the floating point writeback even further by
inserting integer instructions is valid to a degree.  Inserting only integer
instructions between when the fp operation starts and is used will delay
the fp result, but it does hide all of the latency except the 1 extra clock
that the sequencer stalls to let the data write back before it is used.  Even
more important is that in most realistic code the instructions that you can
insert will have some loads or maybe even branches that will free up writeback 
slots.  Of course if you do not have enough independent instructions to 
completely hide the latency it does not matter which scheme you use because
you will be waiting for the data anyway.

As far as the rationale behind prioritization it is as follows.  Single cycle
instructions are given the highest priority to avoid having to build pipelines
and unnecessarily stall single stage function units.  The memory unit has
lowest priority because it is not known soon enough to arbitrate for a
writeback slot whether the cache will be hit or missed.  If it had a higher
priority then writeback slots would be wasted completely during every
cache miss.  In between these two are the floating point units whose
priorities were much more arbitrary.  The multiplier was arbitrarily given
higher priority than the adder.  When the multiplier had contention from 
within for integer and floating point results we went with the integer result.
This decision was somewhat arbitrary but I seem to remember some arguments
that if both imul and fmul were occurring the imul was probably being used in
an array address calculation and should have a higher priority.  Given these
constraints, I don't think any other ordering would have significantly
higher performance over a mix of code. 


-- 
Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin