Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!usc!apple!amdahl!JUTS!rbw00
From: rbw00@ccc.amdahl.com (  213  Richard Wilmot)
Newsgroups: comp.arch
Subject: Re: EFLOP architectures: when and for how much?
Keywords: pipelining
Message-ID: <520k02.v03iW01@JUTS.ccc.amdahl.com>
Date: 8 Nov 90 05:11:35 GMT
References: <1990Oct29.205812.1641@watdragon.waterloo.edu>
Reply-To: rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot)
Organization: Amdahl Corporation, Sunnyvale CA
Lines: 139


In article <1990Oct29.205812.1641@watdragon.waterloo.edu>  gcwilliams@watdragon.waterloo.edu (Graeme Williams)
said:
> In article <1dbs02b=035v01@JUTS.ccc.amdahl.com> rbw00@JUTS.ccc.amdahl.com (  213  Richard Wilmot) writes:
> >gcwilliams@watdragon.waterloo.edu (Graeme Williams) said in article
> ><1990Oct26.191032.9099@watdragon.waterloo.edu> :
>
> >> An instruction cannot be executed in a time shorter than the time
> >> it takes for a light beam to traverse the processing device.
>
> >While it is true that an instruction cannot be EXECUTED in less time
> >than light can traverse the computing device, nFLOPS are measures of
> >THROUGHPUT.
>
> I don't follow this at all - for a *given* processor is not THROUGHPUT
> determined by how fast the processor executes ??

Throughput is determined by how fast the processor executes but not by
the time for an instruction to traverse all the pipeline stages. At any
time there are usually several instructions in the process of executing
in different pipeline stages. It's like an auto assembly line. It might
take two days for what will be a car to move from start to finish
(execution time for 1 car) but hundreds of cars are finished each hour
(that's throughput). Instead of autos we assemble instruction results.

> >We can improve throughput by some amount via pipelining of
> >the execution so that each section of a computing device performs only
> >part of the work for the instruction. After an initial delay to fill the
> >pipeline we could produce a result every 1/mth second where this time
> >between results < time for light to traverse the whole CPU.
>
>  Sure you can do some premliminary work on an instruction using
>  pipelining techniques - but the bottom line is that the EXECUTION of
>  an instruction changes the state of the processor (e.g. a register)
>  and that particular part of the processor involved has a finite
>  physical size and hence a maximum rate at which it can be changed.
>  No amount of pipelining can change this fact.
>
>  Of course one can do parallel execution (if your algorithm permits)
>  but then you're no longer using a *SINGLE* processor/execution unit
>  are you?
>
I'm not always too sure what is meant by a *single* processor.
In some pipelined machines the *state* of the machine does not have
a simple intuitive definition. *A*register* may have several different
values in several different stages of the pipeline. In fact, one of the
more interesting aspects of pipelining is that more than one iteration
of a loop might be 'in execution' in different pipeline stages at the
same time. So the value of register GR3 being loaded in iteration
2 of a loop might be different than the one that was loaded in iteration 1
of the loop if the loop is small enough for more than one iteration to
fit into the pipeline. Things become a bit tricky when you try to provide
'precise' interrupts/traps. If GR3 is used as a divisor and is 0 in one
iteration of a loop then we will have to present back to the software an
image that is consistent with a single machine state despite the fact that
while the machine was running it never had a single state.

Pipelines are today more common than not (e.g. 80486, most SPARC, MIPS,
S/390, etc.).

Things are even more complex when designers begin using multiple instruction
issue strategies as did the IBM RS6000 designers, where, as often as possible,
two instructions are issued in each clock cycle (I understand that, in that
implementation, one must be an integer instruction and the other floating point
in order to exploit the double issue). I doubt that two is the upper limit on
this technique. So, while the two instructions that began at address 44000
are executing, what is the program counter (PC). PC was usually thought of
as part of the machine state AND IT WAS SINGLE VALUED.

Back to the speed of light as a gating factor to computing throughput. If
we can complete an instruction on each cycle and there are 6 stages in our
pipeline then no stage can complete its work faster than light can traverse
that stage. If this IS a restriction then the throughput will be as if the
whole computer were squeezed into the size of one stage, 1/6 of its actual
path. We will get a new result EVERY clock and 1 clock will also be the signal
traversal time through one stage.

This STAGE business is only for the designers' convenience. It is somewhat
artificial and may not be a limiting factor. Consider a systolic array that
we might use for sorting. The array is composed of a number of comparators,
each of which will put the lower of its 2 input values out on its left
output and the other out on its right output:

       I1            I2            I3          I4

         \         /                  \       /
          \      /                      \    /
             C                            C
             |  \                        / |
             |    \                    /   |
             |      \                /     |
             |        \            /       |
             |          \        /         |
             |            \    /           |
             |              \/             |
             |             /  \            |
             |           /      \          |
             |         /          \        |
             |       /              \      |
             |     /                  \    |
             |   /                      \  |

              C                           C

            /   \                        /   \
          /       \                    /       \
         O1        O2                 O3       O4

We may be able to clock the values being sorted between levels of comparators
so that at each comparator at each clock period the two inputs become outputs.
This can easily be extended from the 2 levels shown [I hope] above to N levels.
We will then have N results each cycle with another N results 'in the mill'
right behind. One might call each level of comparators a 'stage' and we are
clearly operating the comparators in parallel.

But maybe the 'stages' above do not have to be a limiting factor. Instead
of a single master clock for all comparators, we could include some latching
so that a comparator which has finished its prior pair can accept the next pair
as soon as they are ready as signaled by latches set by the upstream comparator .
Such self-clocking is more reputed to be more difficult than deterministic
clocking. It would, however, be quite applicable to comparators which can
often finish early (ier than the worst case which must be allowed for in
a march step clock).

The sorting array was only meant for an easy to visualize example. In some
far future designs we might be able to succeed at more complex, less regular
designs which could retain some of the wave-of-solution advantages of the
systolic arrays but solve less regular, more general problems as found in
the usual computer instructions. In such a machine the TIME TO EXECUTE THE
FIRST INSTRUCTION MIGHT BE RELATIVELY LONG, but thereafter we might get the
results of subsequent instructions in a cascade with very little separation
or perhaps in parallel, and maybe even out of order where it doesn't matte r.

So I challenge the posting of the speed of light as a THROUGHPUT limit even
though it does limit the execution of a single instruction from start to
finish. More passengers instead of faster vehicles.
-- 
  Dick Wilmot  | I declaim that Amdahl might disclaim any of my claims.
                 (408) 746-6108