Path: utzoo!mnetor!uunet!husc6!think!ames!lll-lcc!pyramid!prls!mips!mark
From: mark@mips.COM (Mark G. Johnson)
Newsgroups: comp.arch
Subject: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Message-ID: <1642@mips.mips.COM>
Date: 22 Feb 88 00:47:13 GMT
Lines: 169
Keywords: General Electric, DARPA-MIPS-core-ISA


Several articles have recently appeared, alluding to a CMOS  uP
built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
<9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.

These USENET articles mention that the chip, called the "GE RPM-40",
runs a reduced instruction set, operates from 40 MHz clocks, and will
be described at ISSCC (International Solid State Ciruits Conference)
on February 17th.

The paper has now been delivered and published.  The authors were
David Lewis, Theodore Wyman, Mark French, and Frederic Boericke
(no acknowledgments were presented).

Here are a few items of interest on the RPM-40, obtained from the
oral presentation and the printed digest of technical papers.  No
analysis or critique is attempted; only a dump of raw data.  The
most noticeable unknowns are marked with a double asterisk **;
perhaps others can fill in these gaps (if the data isn't secret).
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

			GE RPM-40 CMOS MICROPROCESSOR

1.  The chip was built under a DOD contract.  It is one of several
    implementations under this contract.  There are at least three:
    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.


2.  The instruction set is "DARPA MIPS, core ISA (instruction set
    archictecture)".  In the GE chip, instructions are 16 bits long.
    They are fetched from Instruction Memory two-at-a-time (making
32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

Here is the summary chart of the instruction set:
***************************************************************************
*             15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   *
*           +-----------------------------------------------------------+ *
* ALU       | 0   0 | i |    opcode     |     src1/dest     | src2/imm  | *
*           +-----------------------------------------------------------+ *
* COND      | 0   1 | i |     test      |        src1       | src2/imm  | *
*           +-----------------------------------------------------------+ *
* LD        | 1   0   0 | s |     dest      |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* ST        | 1   0   1 | s |    source     |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* XPLD      | 1   1   0   0 |   xp-field    |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* BRA       | 1   1   0   1 |           branch displacement             | *
*           +-----------------------------------------------------------+ *
* PFX       | 1   1   1   0 |             prefix-immediate              | *
*           +-----------------------------------------------------------+ *
* XPINS     | 1   1   1   1 |         co-processor instruction          | *
*           +-----------------------------------------------------------+ *
***************************************************************************


The ALU format has two register specifiers; presumably you can code
"R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

The Store format has a source register, a base register, and a 4-bit
offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

Branch instructions _seem_ to have only a 12-bit displacement field;
there doesn't appear to be a "Branch Register", "Branch And Link",
or "Conditional Branch" instruction.  Perhaps the "COND" instruction
is the conditional-skip instruction recently mentioned on the net**.

ALU ops can have a 4-bit immediate field.  If this is too small, the
"PREFIX" instruction contains a 12-bit prefix that can be concatenated
to the immediate, to create a 16-bit immediate value.  Perhaps the
PREFIX instruction can be used with loads, stores, and conditionals
too. **

There are 21 32-bit registers; I _believe_ these are arranged as
16 general-purpose registers, plus 5 hardware stacks/queues (used in
exception processing) that are mapped into the register space. **

8-bit and 16-bit external data are converted into the internal 32-bit
format by zero-fill (unsigned) or sign-extend (signed).  This is to
fulfill the DOD requirement for byte and halfword support.  With only
a single "s" bit in the opcode it is difficult to see how these
instructions are encoded (load byte, load haldword, load word) "cross"
(signed, unsigned). **


3.  A four-stage instruction pipeline is used (except for loads, see
    below): Instruction Fetch, Instruction Decode, ALU, Writeback.
    Address calculations (branch addresses or operand addresses) are
performed in the ALU.


4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
    per second.  For the DOD, they benchmarked on a standard US Air Force
    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
value on that mix is 14 MIPS", the speaker said.


5.  The GE implementation uses a Harvard bus structure, with completely
    seperate Instruction Memory and Operand Memory.  GE currently is
    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
At present there is no way to increase the amount of physical memory
(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
for "embedded applications".


6.  There is a "branch target instruction cache" which consists of 32
    entries.  Each entry holds 5 instructions (10 bytes).  When a branch
    occurs, the chip looks (fully associatively) to see whether it holds
the instruction at the branch target address in its cache.  If a hit
(target instruction present) occurs, then the branch target instruction,
and the next 4 instructions, are read from the on-chip cache. Meanwhile
the off-chip Imem is readying itself to begin delivering the 6th thru Nth
instructions after the branch.  Claimed hit rates of the branch target
instruction cache are > 90%.  On a miss there is a 3-cycle latency to get
the Imem SRAM chips delivering instructions (and updating the b.t.i. cache).


7.  The instruction memory contains a "lookahead counter".  This lessens
    traffic on the address bus; instruction addresses only squirt out of
    the CPU after a branch .... leisurely reloading the counter while the
branch target instruction cache supplies the 5 instructions after a branch.


8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
    doesn't use the target register of a load until > 3 instructions after
    the load ("3 load delay slots" in some folks' parlance), then there
is no interlock and instructions are issued one per cycle.  If you use
the target register of a load <= 3 cycles later, there is a pipeline stall
while waiting for the Operand Memory to supply the data.

Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
constraints that prevent 1 store per cycle always, nor did they compare
and contrast loads vs. stores. **


9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
    opcode plus 12 bit coprocessor instruction type) are passed through
    the CPU, and sent over the address bus to the coprocessor(s).  They
can be stored in the branch target address cache.  So it _appears_ that
two cycles are required to do a coprocessor op, one to communicate it
from the CPU to the coprocessor and one to do it **.  GE didn't say
whether there were architecturally-visible register files on the
coprocessors **, but there _appears_ to be an "Xternal Processor Load"
instruction **.  The Floating Point coprocessor is in fab now and is
expected out this month.


10. The CPU chip contains 92,000 transistors and is housed in a 132 pin
    package.  The design style is fully static which is helpful for
    achieving radiation-hard parts.  40 pins are inputs, 46 pins are
outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins &
7 Ground pins.  No mention was made of whether this package configuration
had been "certified" to run at 40 MHz, nor what agency would perform such
certifications. **  The fab process is 1.2 micron bulk CMOS.


11. A simple virtual memory scheme called "most significant bit replacement"
    is used.  A process-id is appended to the MSB's of an address before
    sending it out of the CPU.  A special case occurs when those bits
are all-0's or all-1's.... ** **

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086