Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!usc!apple!amdahl!JUTS!rbw00 From: rbw00@ccc.amdahl.com ( 213 Richard Wilmot) Newsgroups: comp.arch Subject: Re: EFLOP architectures: when and for how much? Keywords: pipelining Message-ID: <520k02.v03iW01@JUTS.ccc.amdahl.com> Date: 8 Nov 90 05:11:35 GMT References: <1990Oct29.205812.1641@watdragon.waterloo.edu> Reply-To: rbw00@JUTS.ccc.amdahl.com ( 213 Richard Wilmot) Organization: Amdahl Corporation, Sunnyvale CA Lines: 139 In article <1990Oct29.205812.1641@watdragon.waterloo.edu> gcwilliams@watdragon.waterloo.edu (Graeme Williams) said: > In article <1dbs02b=035v01@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com ( 213 Richard Wilmot) writes: > >gcwilliams@watdragon.waterloo.edu (Graeme Williams) said in article > ><1990Oct26.191032.9099@watdragon.waterloo.edu> : > > >> An instruction cannot be executed in a time shorter than the time > >> it takes for a light beam to traverse the processing device. > > >While it is true that an instruction cannot be EXECUTED in less time > >than light can traverse the computing device, nFLOPS are measures of > >THROUGHPUT. > > I don't follow this at all - for a *given* processor is not THROUGHPUT > determined by how fast the processor executes ?? Throughput is determined by how fast the processor executes but not by the time for an instruction to traverse all the pipeline stages. At any time there are usually several instructions in the process of executing in different pipeline stages. It's like an auto assembly line. It might take two days for what will be a car to move from start to finish (execution time for 1 car) but hundreds of cars are finished each hour (that's throughput). Instead of autos we assemble instruction results. > >We can improve throughput by some amount via pipelining of > >the execution so that each section of a computing device performs only > >part of the work for the instruction. After an initial delay to fill the > >pipeline we could produce a result every 1/mth second where this time > >between results < time for light to traverse the whole CPU. > > Sure you can do some premliminary work on an instruction using > pipelining techniques - but the bottom line is that the EXECUTION of > an instruction changes the state of the processor (e.g. a register) > and that particular part of the processor involved has a finite > physical size and hence a maximum rate at which it can be changed. > No amount of pipelining can change this fact. > > Of course one can do parallel execution (if your algorithm permits) > but then you're no longer using a *SINGLE* processor/execution unit > are you? > I'm not always too sure what is meant by a *single* processor. In some pipelined machines the *state* of the machine does not have a simple intuitive definition. *A*register* may have several different values in several different stages of the pipeline. In fact, one of the more interesting aspects of pipelining is that more than one iteration of a loop might be 'in execution' in different pipeline stages at the same time. So the value of register GR3 being loaded in iteration 2 of a loop might be different than the one that was loaded in iteration 1 of the loop if the loop is small enough for more than one iteration to fit into the pipeline. Things become a bit tricky when you try to provide 'precise' interrupts/traps. If GR3 is used as a divisor and is 0 in one iteration of a loop then we will have to present back to the software an image that is consistent with a single machine state despite the fact that while the machine was running it never had a single state. Pipelines are today more common than not (e.g. 80486, most SPARC, MIPS, S/390, etc.). Things are even more complex when designers begin using multiple instruction issue strategies as did the IBM RS6000 designers, where, as often as possible, two instructions are issued in each clock cycle (I understand that, in that implementation, one must be an integer instruction and the other floating point in order to exploit the double issue). I doubt that two is the upper limit on this technique. So, while the two instructions that began at address 44000 are executing, what is the program counter (PC). PC was usually thought of as part of the machine state AND IT WAS SINGLE VALUED. Back to the speed of light as a gating factor to computing throughput. If we can complete an instruction on each cycle and there are 6 stages in our pipeline then no stage can complete its work faster than light can traverse that stage. If this IS a restriction then the throughput will be as if the whole computer were squeezed into the size of one stage, 1/6 of its actual path. We will get a new result EVERY clock and 1 clock will also be the signal traversal time through one stage. This STAGE business is only for the designers' convenience. It is somewhat artificial and may not be a limiting factor. Consider a systolic array that we might use for sorting. The array is composed of a number of comparators, each of which will put the lower of its 2 input values out on its left output and the other out on its right output: I1 I2 I3 I4 \ / \ / \ / \ / C C | \ / | | \ / | | \ / | | \ / | | \ / | | \ / | | \/ | | / \ | | / \ | | / \ | | / \ | | / \ | | / \ | C C / \ / \ / \ / \ O1 O2 O3 O4 We may be able to clock the values being sorted between levels of comparators so that at each comparator at each clock period the two inputs become outputs. This can easily be extended from the 2 levels shown [I hope] above to N levels. We will then have N results each cycle with another N results 'in the mill' right behind. One might call each level of comparators a 'stage' and we are clearly operating the comparators in parallel. But maybe the 'stages' above do not have to be a limiting factor. Instead of a single master clock for all comparators, we could include some latching so that a comparator which has finished its prior pair can accept the next pair as soon as they are ready as signaled by latches set by the upstream comparator . Such self-clocking is more reputed to be more difficult than deterministic clocking. It would, however, be quite applicable to comparators which can often finish early (ier than the worst case which must be allowed for in a march step clock). The sorting array was only meant for an easy to visualize example. In some far future designs we might be able to succeed at more complex, less regular designs which could retain some of the wave-of-solution advantages of the systolic arrays but solve less regular, more general problems as found in the usual computer instructions. In such a machine the TIME TO EXECUTE THE FIRST INSTRUCTION MIGHT BE RELATIVELY LONG, but thereafter we might get the results of subsequent instructions in a cascade with very little separation or perhaps in parallel, and maybe even out of order where it doesn't matte r. So I challenge the posting of the speed of light as a THROUGHPUT limit even though it does limit the execution of a single instruction from start to finish. More passengers instead of faster vehicles. -- Dick Wilmot | I declaim that Amdahl might disclaim any of my claims. (408) 746-6108