Path: utzoo!mnetor!uunet!husc6!think!ames!lll-lcc!pyramid!prls!mips!mark From: mark@mips.COM (Mark G. Johnson) Newsgroups: comp.arch Subject: RPM-40 microprocessor @ 40 MHz; data from ISSCC Message-ID: <1642@mips.mips.COM> Date: 22 Feb 88 00:47:13 GMT Lines: 169 Keywords: General Electric, DARPA-MIPS-core-ISA Several articles have recently appeared, alluding to a CMOS uP built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>, <9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>. These USENET articles mention that the chip, called the "GE RPM-40", runs a reduced instruction set, operates from 40 MHz clocks, and will be described at ISSCC (International Solid State Ciruits Conference) on February 17th. The paper has now been delivered and published. The authors were David Lewis, Theodore Wyman, Mark French, and Frederic Boericke (no acknowledgments were presented). Here are a few items of interest on the RPM-40, obtained from the oral presentation and the printed digest of technical papers. No analysis or critique is attempted; only a dump of raw data. The most noticeable unknowns are marked with a double asterisk **; perhaps others can fill in these gaps (if the data isn't secret). ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ GE RPM-40 CMOS MICROPROCESSOR 1. The chip was built under a DOD contract. It is one of several implementations under this contract. There are at least three: General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and Texas Instruments (GaAs Bipolar). Interestingly, they have each chosen a different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages. 2. The instruction set is "DARPA MIPS, core ISA (instruction set archictecture)". In the GE chip, instructions are 16 bits long. They are fetched from Instruction Memory two-at-a-time (making 32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec. Here is the summary chart of the instruction set: *************************************************************************** * 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 * * +-----------------------------------------------------------+ * * ALU | 0 0 | i | opcode | src1/dest | src2/imm | * * +-----------------------------------------------------------+ * * COND | 0 1 | i | test | src1 | src2/imm | * * +-----------------------------------------------------------+ * * LD | 1 0 0 | s | dest | base | offset | * * +-----------------------------------------------------------+ * * ST | 1 0 1 | s | source | base | offset | * * +-----------------------------------------------------------+ * * XPLD | 1 1 0 0 | xp-field | base | offset | * * +-----------------------------------------------------------+ * * BRA | 1 1 0 1 | branch displacement | * * +-----------------------------------------------------------+ * * PFX | 1 1 1 0 | prefix-immediate | * * +-----------------------------------------------------------+ * * XPINS | 1 1 1 1 | co-processor instruction | * * +-----------------------------------------------------------+ * *************************************************************************** The ALU format has two register specifiers; presumably you can code "R3 := R4 + R3" but you cannot code "R3 := R4 + R1". The Store format has a source register, a base register, and a 4-bit offset field. Loads have a dest reg, a base reg, and a 4-bit offset. Branch instructions _seem_ to have only a 12-bit displacement field; there doesn't appear to be a "Branch Register", "Branch And Link", or "Conditional Branch" instruction. Perhaps the "COND" instruction is the conditional-skip instruction recently mentioned on the net**. ALU ops can have a 4-bit immediate field. If this is too small, the "PREFIX" instruction contains a 12-bit prefix that can be concatenated to the immediate, to create a 16-bit immediate value. Perhaps the PREFIX instruction can be used with loads, stores, and conditionals too. ** There are 21 32-bit registers; I _believe_ these are arranged as 16 general-purpose registers, plus 5 hardware stacks/queues (used in exception processing) that are mapped into the register space. ** 8-bit and 16-bit external data are converted into the internal 32-bit format by zero-fill (unsigned) or sign-extend (signed). This is to fulfill the DOD requirement for byte and halfword support. With only a single "s" bit in the opcode it is difficult to see how these instructions are encoded (load byte, load haldword, load word) "cross" (signed, unsigned). ** 3. A four-stage instruction pipeline is used (except for loads, see below): Instruction Fetch, Instruction Decode, ALU, Writeback. Address calculations (branch addresses or operand addresses) are performed in the ALU. 4. Performance with 40 MHz clocks is 40 million native RPM-40 opcodes per second. For the DOD, they benchmarked on a standard US Air Force mix of instrictions called the `DAIS Mix'. "The most pessimistic value on that mix is 14 MIPS", the speaker said. 5. The GE implementation uses a Harvard bus structure, with completely seperate Instruction Memory and Operand Memory. GE currently is using a total of 128Kbytes of memory: 16KWords of static RAM, each, for the IMem and OMem. Imem needs 50ns chips and Omem needs 25ns chips. At present there is no way to increase the amount of physical memory (e.g. with dynamic RAM). The speaker said that the CPU chip is intended for "embedded applications". 6. There is a "branch target instruction cache" which consists of 32 entries. Each entry holds 5 instructions (10 bytes). When a branch occurs, the chip looks (fully associatively) to see whether it holds the instruction at the branch target address in its cache. If a hit (target instruction present) occurs, then the branch target instruction, and the next 4 instructions, are read from the on-chip cache. Meanwhile the off-chip Imem is readying itself to begin delivering the 6th thru Nth instructions after the branch. Claimed hit rates of the branch target instruction cache are > 90%. On a miss there is a 3-cycle latency to get the Imem SRAM chips delivering instructions (and updating the b.t.i. cache). 7. The instruction memory contains a "lookahead counter". This lessens traffic on the address bus; instruction addresses only squirt out of the CPU after a branch .... leisurely reloading the counter while the branch target instruction cache supplies the 5 instructions after a branch. 8. Loads take 7 cycles while ALU operations take 4 cycles. If a program doesn't use the target register of a load until > 3 instructions after the load ("3 load delay slots" in some folks' parlance), then there is no interlock and instructions are issued one per cycle. If you use the target register of a load <= 3 cycles later, there is a pipeline stall while waiting for the Operand Memory to supply the data. Stores "can" operate at "up to" 1 per cycle. GE didn't discuss the constraints that prevent 1 store per cycle always, nor did they compare and contrast loads vs. stores. ** 9. Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction" opcode plus 12 bit coprocessor instruction type) are passed through the CPU, and sent over the address bus to the coprocessor(s). They can be stored in the branch target address cache. So it _appears_ that two cycles are required to do a coprocessor op, one to communicate it from the CPU to the coprocessor and one to do it **. GE didn't say whether there were architecturally-visible register files on the coprocessors **, but there _appears_ to be an "Xternal Processor Load" instruction **. The Floating Point coprocessor is in fab now and is expected out this month. 10. The CPU chip contains 92,000 transistors and is housed in a 132 pin package. The design style is fully static which is helpful for achieving radiation-hard parts. 40 pins are inputs, 46 pins are outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins & 7 Ground pins. No mention was made of whether this package configuration had been "certified" to run at 40 MHz, nor what agency would perform such certifications. ** The fab process is 1.2 micron bulk CMOS. 11. A simple virtual memory scheme called "most significant bit replacement" is used. A process-id is appended to the MSB's of an address before sending it out of the CPU. A special case occurs when those bits are all-0's or all-1's.... ** ** ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -Mark Johnson *** DISCLAIMER: The opinions above are personal. *** UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-991-0208 US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086