Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!decwrl!spar!hunt
From: hunt@spar.SPAR.SLB.COM (Neil Hunt)
Newsgroups: comp.arch
Subject: Self timed processors (was Re: Cycle stretching)
Message-ID: <810@spar.SPAR.SLB.COM>
Date: 19 Feb 88 01:58:20 GMT
References: <844@daisy.UUCP> <20409@amdcad.AMD.COM> <1232@alliant.Alliant.COM>
Reply-To: hunt@spar.UUCP (Neil Hunt)
Organization: SPAR - Schlumberger Palo Alto Research
Lines: 88

In article <1232@alliant.Alliant.COM> lackey@alliant.UUCP (Stan Lackey) writes:
>Actually, I once heard a proposal to make a microprocessor totally 
>ansynchronous, with logic added to determine when each stage of logic was
>complete, and use that to start the next stage.  It would take advantage of
>the fact that an ALU might be done sooner when adding small numbers, and lots
>of times the numbers added are small (compared to the total size of the 
>data path).  "Self-timed" is what it was called.

>An interesting idea, but likely wouldn't work too well in a pipeline, and
>would be difficult to interface to.  -Stan

I think that it would actually work rather well in a pipeline, with a little
care.

First, to recap on asynchonous signalling: an event is indicated
by a signal transition on a wire (with either sign). In the simplest
form of signaling, two wires are used for each bit of data. A transition
on one wire indicates the transmission of a one bit, while the transition
on the other wire indicates the transmission of a zero bit.
Thus a single transition signals not only the arrival of an event,
but also the type of event.
The receiving unit signals back along a single wire that the data
has been accepted, and more may be sent. To conserve wires, a data bundle
is sometimes used. Here the bits of data are put on a bundle
of wire in the conventional manner, using level signalling, and a
single event wire transition signals the arrival of new stable data
to the next stage.  Again, an acknowledgement transition on a return
wire is used.

Each section of the pipeline has event connections to the unit
preceeding and following which signal the availability and consumption
of each data item. Consider a linear pipeline of processing elements.
Data enters at one end, and propagates through the stages. Its speed
of propagation is limited by the speed of the processing stages, and by
the need to wait until the next stage is available. This means that
the pipeline will run correctly at the speed of the slowest component;
this would have been the clock frequency of a synchronous system.
But if the slowest component is speeded up, perhaps by processing data
which involves less propagation up the carry chain, the whole pipeline
speeds up to take advantage of the smaller delays.

The problem with pipelines running in a self timed fashion concerns
external conditions. The obvious example is in a branch instruction;
in a synchronous system, there are a known number of branch delay slots,
which can be filled or empty, squashed, predicted, etc. The machine is
designed to throw away the wasted cycles in an incorrectly predicted
branch. But in a self timed system, it is not possible to say
how many instructions could be in the pipeline when the branch takes
the unpredicted direction. (A slower instruction could have entered the
pipeline, and be lagging behind a fast branch instruction, or
several fast instructions could all be bumper-to-bumper behind
the branch instruction.)

The answer is to make the relationships between the stages explicit,
and represent them as additional signalling connections. For example,
we could have some logic maintaining a state of the pipeline: either
full and running, or flushing discarded instructions.
When a taken branch is encountered, this is set to flushing mode.
A signal which arrives with the new stream of instructions from
the memory system resets this to the running state.
The state of this unit controls whether the results of
computations are written or discarded. In this way, regardless of the
number of instructions actually in the pipeline when the branch was
taken, the processor can start to execute the new stream as soon as
it starts to arrive in the processor; there is no need to wait
for the longest possible time which it might take for the pipeline
to flush itself, as the entire processor is self timed.
Appropriate use of FIFOs and signal acknowledgements takes care of
the situation where the processor might have more than one
taken branch in the pipeline at once, which might, without care,
lead to the signal for the earlier branch being interpreted prematurely
as the OK to start using instructions after the second branch.

Concerning interfacing; many system busses are currently asynchronous,
offering the same advantages of being able to use the speed of the cheaper
operations, while not being limited to the slowness of the more expensive
operations except when they are actually being performed.
With a synchronous processor, some of this advantage is lost as the
asynchronous delays on the bus must be quantised to clock cycles when
interfacing to a synchronous processor. Would it not be better to have the
entire system running in an asynchronous manner ?

I think that this is in fact rather an exciting possibility.

Neil/.

					hunt@spar.slb.com
					...{amdahl|decwrl|hplabs}!spar!hunt