Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!cornell!batcomputer!itsgw!imagine!pawl20.pawl.rpi.edu!jesup From: jesup@pawl20.pawl.rpi.edu (Randell E. Jesup) Newsgroups: comp.arch Subject: Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC Message-ID: <409@imagine.PAWL.RPI.EDU> Date: 23 Feb 88 08:22:51 GMT References: <1642@mips.mips.COM> Sender: news@imagine.PAWL.RPI.EDU Reply-To: beowulf!lunge!jesup@steinmetz.UUCP Organization: RPI Public Access Workstation Lab - Troy, NY Lines: 165 Keywords: General Electric, DARPA-MIPS-core-ISA In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: =These USENET articles mention that the chip, called the "GE RPM-40", =runs a reduced instruction set, operates from 40 MHz clocks, and will =be described at ISSCC (International Solid State Ciruits Conference) =on February 17th. =The =most noticeable unknowns are marked with a double asterisk **; =perhaps others can fill in these gaps (if the data isn't secret). To my knowlege, every thing I say in this article is public information. (I was on the RPM-40 software team for 1 year, until July 87.) =1. The chip was built under a DOD contract. It is one of several = implementations under this contract. There are at least three: = General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and =Texas Instruments (GaAs Bipolar). Interestingly, they have each chosen a =different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages. Also there's Sperry/UniSys (also CMOS). It's not suprising that the GaAs people use longer pipelines, they can't do much in that time, and are restricted on transistors. =2. The instruction set is "DARPA MIPS, core ISA (instruction set = archictecture)". In the GE chip, instructions are 16 bits long. = They are fetched from Instruction Memory two-at-a-time (making =32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec. All the machines listed above are designed so that 'Core ISA' (a generic RISC assembly language, designed by Dr Gross of CMU) can be translated to their native assembly languages. =The ALU format has two register specifiers; presumably you can code ="R3 := R4 + R3" but you cannot code "R3 := R4 + R1". Correct, r3 = r4 + r1 becomes r3 = r4; r3 = r3 + r1. =The Store format has a source register, a base register, and a 4-bit =offset field. Loads have a dest reg, a base reg, and a 4-bit offset. =Branch instructions _seem_ to have only a 12-bit displacement field; =there doesn't appear to be a "Branch Register", "Branch And Link", =or "Conditional Branch" instruction. Perhaps the "COND" instruction =is the conditional-skip instruction recently mentioned on the net**. Any of those displacements can be prefixed by PFX instruction(s) to extend the displacement up to 32 bits. Yes, Cond conditionally skips the next instruction, they can be 'stacked' to provide complex conditionals. =ALU ops can have a 4-bit immediate field. If this is too small, the ="PREFIX" instruction contains a 12-bit prefix that can be concatenated =to the immediate, to create a 16-bit immediate value. Perhaps the =PREFIX instruction can be used with loads, stores, and conditionals =too. ** Yes, but you can use up to 3 prefixes to get 32 bit constants (in reality, 32 bits are not used very often.) =There are 21 32-bit registers; I _believe_ these are arranged as =16 general-purpose registers, plus 5 hardware stacks/queues (used in =exception processing) that are mapped into the register space. ** Minor error, there are 21 gp registers, plus a number of special purpose registers, mostly reserved to supervisor mode. Several are stacks for internal state mapped into register slots. User available registers are the PC, Trap register, sr2 (has various flags), and the Size register (determines the size of non-word LD/ST, allows some register remapping, and a bit for doing 16-bit overflow detection instead of 32). =8-bit and 16-bit external data are converted into the internal 32-bit =format by zero-fill (unsigned) or sign-extend (signed). This is to =fulfill the DOD requirement for byte and halfword support. With only =a single "s" bit in the opcode it is difficult to see how these =instructions are encoded (load byte, load haldword, load word) "cross" =(signed, unsigned). ** There are state bits in the size register that control some of this. The 's' bit specifies "load word" or "load not word" (type defined by size bits, usually you're only playing with one non-word type). =4. Performance with 40 MHz clocks is 40 million native RPM-40 opcodes = per second. For the DOD, they benchmarked on a standard US Air Force = mix of instrictions called the `DAIS Mix'. "The most pessimistic =value on that mix is 14 MIPS", the speaker said. DAIS is a 1750a (Air Force Standard CPU) mix of instructions, the DAIS timings are heavily FPU dependant and are in 1750a MIPS, not RPM-40! =5. The GE implementation uses a Harvard bus structure, with completely = seperate Instruction Memory and Operand Memory. GE currently is = using a total of 128Kbytes of memory: 16KWords of static RAM, each, =for the IMem and OMem. Imem needs 50ns chips and Omem needs 25ns chips. =At present there is no way to increase the amount of physical memory =(e.g. with dynamic RAM). The speaker said that the CPU chip is intended =for "embedded applications". Well.... The current board has 128K, but the CPU supports full 32-bit addressing. Nothing says you can't put more than 128K on it, or use some sort of external cache. The only limits are the amount of capacitance the CPU can drive at 40 Mhz. =8. Loads take 7 cycles while ALU operations take 4 cycles. If a program = doesn't use the target register of a load until > 3 instructions after = the load ("3 load delay slots" in some folks' parlance), then there =is no interlock and instructions are issued one per cycle. If you use =the target register of a load <= 3 cycles later, there is a pipeline stall =while waiting for the Operand Memory to supply the data. That is only a software stall, eg NOP-insertion. Of course, the reorganizer will try to fill it. Note that the 7 & 4 cycle figures include all pipe stages, including the illusionary IF stage. =Stores "can" operate at "up to" 1 per cycle. GE didn't discuss the =constraints that prevent 1 store per cycle always, nor did they compare =and contrast loads vs. stores. ** There are some interlocks with other address-bus using instructions. You can string as many stores in a row you want, or as many loads. =9. Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction" = opcode plus 12 bit coprocessor instruction type) are passed through = the CPU, and sent over the address bus to the coprocessor(s). They =can be stored in the branch target address cache. So it _appears_ that =two cycles are required to do a coprocessor op, one to communicate it =from the CPU to the coprocessor and one to do it **. GE didn't say =whether there were architecturally-visible register files on the =coprocessors **, but there _appears_ to be an "Xternal Processor Load" =instruction **. The Floating Point coprocessor is in fab now and is =expected out this month. The CPU doesn't have to wait, it just issues the instruction over the address bus. There is an XPLoad instruction, coprocessor dependant. =11. A simple virtual memory scheme called "most significant bit replacement" = is used. A process-id is appended to the MSB's of an address before = sending it out of the CPU. A special case occurs when those bits =are all-0's or all-1's.... ** ** Tasks can be allocated memory under this scheme in power-of-two sized chunks == 256 bytes. Of course, instructions and data have different mappings. =++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ =-Mark Johnson *** DISCLAIMER: The opinions above are personal. *** =UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark TEL: 408-991-0208 =US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 I hate to admit this, but it was decided that Core ISA mandated little-endian memory layout, since several other Core ISA users had implemented their CPUs that way already when we questioned it. (Will little-endianism dog out heels forever? :-) VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9 16Mhz 68020's with 0 wait-state memory and no MMU delay. (Not your standard unix box envirionment 68020.) { WARNING: this is VERY ROUGH, and though I have calulations available that say this, they are very back-of-napkin style! However, it's probably not TOO far off. Maybe we'll have real performance figures at some point from GE (I don't work there anymore). } // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup