Path: utzoo!utgpu!news-server.csri.toronto.edu!mailrus!uwm.edu!rpi!zaphod.mps.ohio-state.edu!swrinde!cs.utexas.edu!oakhill!marvin From: marvin@oakhill.UUCP (Marvin Denman) Newsgroups: comp.arch Subject: Re: High-Priority Instructions Message-ID: <3612@yogi.oakhill.UUCP> Date: 30 Jul 90 21:01:47 GMT References: <58428@bbn.BBN.COM> <37310@shemp.CS.UCLA.EDU> <1990Jul27.161856.25701@mozart.amd.com> Reply-To: marvin@yogi.UUCP (Marvin Denman) Organization: Motorola Inc., Austin, Texas Lines: 88 In article <1990Jul27.161856.25701@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: >In <37310@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes: >>In article <58428@bbn.BBN.COM> schooler@oak.bbn.com (Richard Schooler) writes: >>> [description of problems with scheduling for writeback slot on 88k deleted] >>It looks like Motorola did not want to deal with functional unit >>latencies when instructions are issued. >>Otherwise they could use a result shift register where the "writeback slot" >>is reserved in advance according to the latency of the functional unit used. >>Conflicts are thus resolved in advance. Collisions cause stalling of >>the issuing unit. > >Yep, a pretty straightforward thing to do control-wise. You also want >to be able to forward either or both input operands from it, so it's a >bit more than a simple shift register, but nevertheless nice for handling >medium latencies. You might want to revert to a priority scheme for >divide though, or serialize. Another benefit is that it keeps your >register file updates in order. Might have been just a tad too much >realestate for them though, assuming they considered it. I think that posters proposal of a shift register was intended for writeback reservations only, but I may have misunderstood. At the time we designed the 88100 we did not seriously consider a shift register of results which is what Dave Christie seems to be talking about because of circuit complexities. We definitely considered a shift register for write back result reservations, but due to several considerations we decided that the arbitration scheme was more flexible. The divide latencies would have required a very long shift register and load cache misses would have required stalling all of the pipelines negating one of the benefits of interlocks. The implementation of shift register writeback reservation scheme would have been fairly simple so that is not why we did not do it. > >>> [Richard's instruction-based priority bit scheme deleted] >> >>Let's see, instructions are held up because other instructions >>have higher priority, they will proceed only if there is an empty slot. >>If there is no empty slot, that means that other instructions are producing >>useful work and that the write-back slot is running at full throughput > >But if a certain write slot doesn't satisfy a dependency that exists at >decode then issue stalls, which will cause an empty write slot downstream >and lower throughput. You would typically want to separate a FP instruction >from a instruction that depended on it; if you throw as many integer >instructions in between as necessary to compensate for the FP latency, >those integer instructions just end up stretching out the FP latency >by taking priority for writeback, and the extra FP latency isn't hidden >at all - ouch! The proper priority scheme, IMHO, would give preference >to earlier instructions in the sequence, which is of course the ones with >the longer latency. I'd like the hear their rational for this. Maybe >it was arbitrarily chosen and they just expected the compilers to keep >it all moving (bet they learned a lesson there!). There's an example >of rescheduling in an article on the 88K in June's IEEE Micro, but it >conveniently avoids write slot contention with an appropriate mix of >various latency operations. This of course does happen in the real world, >but there's no mention of what happens when things don't work out so nicely. Your point about just stalling the floating point writeback even further by inserting integer instructions is valid to a degree. Inserting only integer instructions between when the fp operation starts and is used will delay the fp result, but it does hide all of the latency except the 1 extra clock that the sequencer stalls to let the data write back before it is used. Even more important is that in most realistic code the instructions that you can insert will have some loads or maybe even branches that will free up writeback slots. Of course if you do not have enough independent instructions to completely hide the latency it does not matter which scheme you use because you will be waiting for the data anyway. As far as the rationale behind prioritization it is as follows. Single cycle instructions are given the highest priority to avoid having to build pipelines and unnecessarily stall single stage function units. The memory unit has lowest priority because it is not known soon enough to arbitrate for a writeback slot whether the cache will be hit or missed. If it had a higher priority then writeback slots would be wasted completely during every cache miss. In between these two are the floating point units whose priorities were much more arbitrary. The multiplier was arbitrarily given higher priority than the adder. When the multiplier had contention from within for integer and floating point results we went with the integer result. This decision was somewhat arbitrary but I seem to remember some arguments that if both imul and fmul were occurring the imul was probably being used in an array address calculation and should have a higher priority. Given these constraints, I don't think any other ordering would have significantly higher performance over a mix of code. -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin