Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!apple!voder!berlioz!nelson
From: nelson@berlioz (Ted Nelson)
Newsgroups: comp.arch
Subject: Re: SISC
Summary: Pipelining it would be easy?
Message-ID: <184@berlioz.nsc.com>
Date: 6 May 89 16:51:50 GMT
Reply-To: nelson@berlioz.UUCP (Ted Nelson)
Distribution: usa
Organization: National Semiconductor, Santa Clara
Lines: 60


I am fascinated by the entire concept of a single instruction computer, and
  I feel it is possible that this idea will make it to market as a extremely
  low-cost general-purpose processor.  Of course, an entire generation of
  software tools will have to be rethought;  for one, self-modifying code
  will become a much more powerful (necessary?) method.

But the memory dependence is extremely high.

The van der Poel instruction requires 3 operand fetches, 2 data reads, and
  one data write.  Assuming that these cannot take place concurrently, that
  we have a system based on 100ns memory, and ignoring all other factors,
  each instruction takes 600ns.  This instruction rate is about equivalent
  to a 12 Mhz 68000, but each instruction is considerably less powerful.

First idea:  Since the operand fetches are in adjacent words, we can fetch
  them at the same time using triple-interleaved memory (this will require
  a bit more logic than typical interleaving) and three separate buses
  on the processor -- which is no problem since they are independent.  We
  could also take care of the data reads in the same way by putting a
  (severe?) restriction on the software (a la RISC "let the compiler deal
  with it") that operands cannot be of the same modulus 3.  So using this
  idea, we get each instruction's memory access time down to 300 ns --
  twice the throughput.

Second obvious idea:  Pipeline the sucker.  I only have a basic understanding
  of pipelines, but it seems to me that a straight three or four stage
  pipe cannot work because of the memory conflict -- the fetch (F), read (R),
  and write (W) stages cannot operate concurrently.  So let me propose
  two more stages:  Computation (C) {essentially the subtract} and Branch (B)
  computation based on the condition code (the only condition code, Negative).
  The stages operate FRCWB, and in operation will be as follows:

          F R C W B
              F R C W B
                  F R C W B

As you can see, we still have a memory conflict between the Write of the
  "current" instruction and the Read of the next instruction.  My first
  reaction was to add another software restriction in that the Write and
  the two Reads had to have addresses of different modulus 3.  But I think
  that this is too severe and renders it unusable -- this is too much for
  the compiler to handle.  Or is it?

Can anyone come up with a better pipelining scheme?  Or anyway of improving
  the performance?  Keep in mind that the market for this is as a very
  low cost processor, so the problem cannot be solved by using dual-port
  RAM.  Unless, of course, dual-port RAM drops considerably in price.

Or we could use National Semiconductor's new memory product:  1 Megabit
  Write-Only Memory (WOM).  This is extremely inexpensive, has an access
  time of only 10 ns, and will be available in a dual-port version in only
  a few months.  If you wish to order any of this great part, pleast
  contact me directly -- it is such a secret project that we haven't let
  Marketing in on it yet.

-- Ted.

"When comes The Revolution, things will be different!
    Not better.  Just different."