Path: utzoo!mnetor!uunet!steinmetz!sunset!oconnor
From: oconnor@sunset.steinmetz (Dennis M. O'Connor)
Newsgroups: comp.arch
Subject: Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Message-ID: <9651@steinmetz.steinmetz.UUCP>
Date: 23 Feb 88 04:12:47 GMT
Sender: news@steinmetz.steinmetz.UUCP
Reply-To: sunset!oconnor@steinmetz.UUCP
Organization: GE Corporate R&D Center
Lines: 220
Keywords: General Electric, DARPA-MIPS-core-ISA

An article by mark@mips.COM (Mark G. Johnson) says:
] The paper has now been delivered and published.  The authors were
] David Lewis, Theodore Wyman, Mark French, and Frederic Boericke
] (no acknowledgments were presented).

ISSCC is a circuit-design conference : these are the three people
most responsible for the circuit design, I think.

] 			GE RPM-40 CMOS MICROPROCESSOR
] 
] 1.  The chip was built under a DOD contract.  It is one of several
]     implementations under this contract.  There are at least three:
]     General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
] Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
] different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.

TI is teamed with CDC on there GaAs effort, and Sperry (Unisys) had a
contract for a different CMOS version.

] 2.  The instruction set is "DARPA MIPS, core ISA (instruction set
]     archictecture)".

The contract for all the processors specified efficient execution
AFTER TRANSLATION of the Core ISA. Core ISA is NOT the machine language.

]  In the GE chip, instructions are 16 bits long.
]  They are fetched from Instruction Memory two-at-a-time (making
]  32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.
 
] Here is the summary chart of the instruction set:
] ***************************************************************************
] *             15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   *
] *           +-----------------------------------------------------------+ *
] * ALU       | 0   0 | i |    opcode     |     src1/dest     | src2/imm  | *
] *           +-----------------------------------------------------------+ *
] * COND      | 0   1 | i |     test      |        src1       | src2/imm  | *
] *           +-----------------------------------------------------------+ *
] * LD        | 1   0   0 | s |     dest      |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * ST        | 1   0   1 | s |    source     |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * XPLD      | 1   1   0   0 |   xp-field    |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * BRA       | 1   1   0   1 |           branch displacement             | *
] *           +-----------------------------------------------------------+ *
] * PFX       | 1   1   1   0 |             prefix-immediate              | *
] *           +-----------------------------------------------------------+ *
] * XPINS     | 1   1   1   1 |         co-processor instruction          | *
] *           +-----------------------------------------------------------+ *
] ***************************************************************************
] 
] The ALU format has two register specifiers; presumably you can code
] "R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

Correct, you need a two-instruction pair for three-address ops.

] The Store format has a source register, a base register, and a 4-bit
] offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

] Branch instructions _seem_ to have only a 12-bit displacement field;

... extendable to any length by PREFIX instructions ...

] there doesn't appear to be a "Branch Register", "Branch And Link",

... Branch Register is simply a MOV, B&L is a two-instruction pair ...

] or "Conditional Branch" instruction.  Perhaps the "COND" instruction
] is the conditional-skip instruction recently mentioned on the net**.

Yes, it can be applied to ANY instruction, not just branches.

] ALU ops can have a 4-bit immediate field.  If this is too small, the
] "PREFIX" instruction contains a 12-bit prefix that can be concatenated
] to the immediate, to create a 16-bit immediate value.  Perhaps the
] PREFIX instruction can be used with loads, stores, and conditionals
] too. **

PREFIXs can be prepended to ANY instruction that contains an immediate
field, including other PREFIX instructions, allowing immediates of any
size to be formed in the instruction stream without use of a g.p.
register or disruption of the pipeline flow.

] There are 21 32-bit registers; I _believe_ these are arranged as
] 16 general-purpose registers, plus 5 hardware stacks/queues (used in
] exception processing) that are mapped into the register space. **

There are 21 32-bit G.P. registers, plus various status registers, the
PCR, a TRAP register, and 5 hardware queues used for exception
processing. There are 32 register positions in the register map.

] 8-bit and 16-bit external data are converted into the internal 32-bit
] format by zero-fill (unsigned) or sign-extend (signed).  This is to
] fulfill the DOD requirement for byte and halfword support.  With only
] a single "s" bit in the opcode it is difficult to see how these
] instructions are encoded (load byte, load haldword, load word) "cross"
] (signed, unsigned). **

The one bit differentiates WORD and NON-WORD. What NON-WORD signifies
is determined by two bits (three for LD) in the user-accessable SR2 register.
 
] 3.  A four-stage instruction pipeline is used (except for loads, see
]     below): Instruction Fetch, Instruction Decode, ALU, Writeback.
]     Address calculations (branch addresses or operand addresses) are
]     performed in the ALU.

The "Instruction Fetch" (IF) stage doesn't really exist. The
instruction memory system is a look-ahead design.

] 5.  The GE implementation uses a Harvard bus structure, with completely
]     seperate Instruction Memory and Operand Memory.  GE currently is
]     using a total of 128Kbytes of memory: 16KWords of static RAM, each,
] for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
] At present there is no way to increase the amount of physical memory
] (e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
] for "embedded applications".

Dynamic RAM was not deemed applicable to the environments in which the
RPM40 is designed to function. The current limits on memory size are
NOT architectural, but a function of the size ands speed of available
RAM and the drive capacity of the RPM40 address bus drivers. The RPM40
is architectually able to address 4GBytes of instruction and 4GBytes
of operand memory.

] 6.  There is a "branch target instruction cache" which consists of 32
]     entries.  Each entry holds 5 instructions (10 bytes).  When a branch
]     occurs, the chip looks (fully associatively) to see whether it holds
] the instruction at the branch target address in its cache.  If a hit
] (target instruction present) occurs, then the branch target instruction,
] and the next 4 instructions, are read from the on-chip cache. Meanwhile
] the off-chip Imem is readying itself to begin delivering the 6th thru Nth
] instructions after the branch.  Claimed hit rates of the branch target
] instruction cache are > 90%.  On a miss there is a 3-cycle latency to get
] the Imem SRAM chips delivering instructions (and updating the b.t.i. cache).

Good luck on your patent application, AMD29000 people. This design
dates back to March 1986, was "published" by GE in October 1986, and is
first mentioned back in '75 or '76 in some SIGArch conference
proceedings. GE didn't think the architecture it was patentable.
Various implimentations, of course, may be.
 
] 7.  The instruction memory contains a "lookahead counter".  This lessens
]     traffic on the address bus; instruction addresses only squirt out of
]     the CPU after a branch .... leisurely reloading the counter while the
] branch target instruction cache supplies the 5 instructions after a branch.

"Leisurely" was a major part of RPM40 design : no splitting cycles on
external busses. 25ns just isn't long enough to multiplex, in CMOS.

] 8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
]     doesn't use the target register of a load until > 3 instructions after
]     the load ("3 load delay slots" in some folks' parlance), then there
] is no interlock and instructions are issued one per cycle.  If you use
] the target register of a load <= 3 cycles later, there is a pipeline stall
] while waiting for the Operand Memory to supply the data.
] 
] Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
] constraints that prevent 1 store per cycle always, nor did they compare
] and contrast loads vs. stores. **

You can do Stores every cycle, or Loads every cycle, if nothing else
interferes. And there's a lot that does. For instance, LD and ST don't
use the D bus during the same pipestage. This of course leads to a
pipeline hazard when a ST follows a LD by particular distances...

] 9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
]     opcode plus 12 bit coprocessor instruction type) are passed through
]     the CPU, and sent over the address bus to the coprocessor(s).  They
] can be stored in the branch target address cache.  So it _appears_ that
] two cycles are required to do a coprocessor op, one to communicate it
] from the CPU to the coprocessor and one to do it **.

It does take more than two cycles of LATENCY to do an XP op. However,
the RATE at which they can be done is one per cycle, as the
communication and execution of XP ops is pipelined. Everything in
RPM40 is pipelined : CPU, I-Cache, Memories, Coprocessors.

] GE didn't say
] whether there were architecturally-visible register files on the
] coprocessors **, but there _appears_ to be an "Xternal Processor Load"
] instruction **.  The Floating Point coprocessor is in fab now and is
] expected out this month.

XP architecture is transparent to the CPU. You want visible registers
in the XPs ? No problem. The FPU does have them. But the FPU has NOT
been "published" yet, so shouldn't be discussed.

] 10. The CPU chip contains 92,000 transistors and is housed in a 132 pin
]     package.  The design style is fully static which is helpful for
]     achieving radiation-hard parts.  40 pins are inputs, 46 pins are
] outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins &
] 7 Ground pins.  No mention was made of whether this package configuration
] had been "certified" to run at 40 MHz, nor what agency would perform such
] certifications. **  The fab process is 1.2 micron bulk CMOS.

The package has run in excess of 75MHz. It's a leadless ceramic chip
carrier. It was chosen early in '86 because it had already been
certified (by someone GE trusts, I guess : maybe VHSIC ?) to run at
these speeds. The CPU IS running at 40MHz on a Mupac wire-wrap board,
executing the entire instruction set w/out problems ( once the two
clock phases were brought to the correct values )
 
] 11. A simple virtual memory scheme called "most significant bit replacement"
]     is used.  A process-id is appended to the MSB's of an address before
]     sending it out of the CPU.  A special case occurs when those bits
] are all-0's or all-1's.... ** **

0 to 23 of the MSb's are replaced, but if all the replaced bits aren't
allthe same as the most significant NON-replaced bit, an exception occurrs.

] -Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
] UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
] US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

This and the other DARPA MIPS processors are descendants,
philosophically anyway, of the original Stanford MIPS processor.
--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."