Path: utzoo!mnetor!uunet!husc6!mit-eddie!uw-beaver!cornell!batcomputer!itsgw!imagine!pawl20.pawl.rpi.edu!jesup
From: jesup@pawl20.pawl.rpi.edu (Randell E. Jesup)
Newsgroups: comp.arch
Subject: Re: RPM-40 microprocessor @ 40 MHz; data from ISSCC
Message-ID: <409@imagine.PAWL.RPI.EDU>
Date: 23 Feb 88 08:22:51 GMT
References: <1642@mips.mips.COM>
Sender: news@imagine.PAWL.RPI.EDU
Reply-To: beowulf!lunge!jesup@steinmetz.UUCP
Organization: RPI Public Access Workstation Lab - Troy, NY
Lines: 165
Keywords: General Electric, DARPA-MIPS-core-ISA

In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
=These USENET articles mention that the chip, called the "GE RPM-40",
=runs a reduced instruction set, operates from 40 MHz clocks, and will
=be described at ISSCC (International Solid State Ciruits Conference)
=on February 17th.

=The
=most noticeable unknowns are marked with a double asterisk **;
=perhaps others can fill in these gaps (if the data isn't secret).

	To my knowlege, every thing I say in this article is public
information.  (I was on the RPM-40 software team for 1 year, until July 87.)

=1.  The chip was built under a DOD contract.  It is one of several
=    implementations under this contract.  There are at least three:
=    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
=Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
=different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.

	Also there's Sperry/UniSys (also CMOS).  It's not suprising that the
GaAs people use longer pipelines, they can't do much in that time, and are
restricted on transistors.

=2.  The instruction set is "DARPA MIPS, core ISA (instruction set
=    archictecture)".  In the GE chip, instructions are 16 bits long.
=    They are fetched from Instruction Memory two-at-a-time (making
=32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

	All the machines listed above are designed so that 'Core ISA' (a
generic RISC assembly language, designed by Dr Gross of CMU) can be translated
to their native assembly languages.

=The ALU format has two register specifiers; presumably you can code
="R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

	Correct, r3 = r4 + r1 becomes r3 = r4; r3 = r3 + r1.

=The Store format has a source register, a base register, and a 4-bit
=offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

=Branch instructions _seem_ to have only a 12-bit displacement field;
=there doesn't appear to be a "Branch Register", "Branch And Link",
=or "Conditional Branch" instruction.  Perhaps the "COND" instruction
=is the conditional-skip instruction recently mentioned on the net**.

	Any of those displacements can be prefixed by PFX instruction(s)
to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
the next instruction, they can be 'stacked' to provide complex conditionals.

=ALU ops can have a 4-bit immediate field.  If this is too small, the
="PREFIX" instruction contains a 12-bit prefix that can be concatenated
=to the immediate, to create a 16-bit immediate value.  Perhaps the
=PREFIX instruction can be used with loads, stores, and conditionals
=too. **

	Yes, but you can use up to 3 prefixes to get 32 bit constants (in
reality, 32 bits are not used very often.)

=There are 21 32-bit registers; I _believe_ these are arranged as
=16 general-purpose registers, plus 5 hardware stacks/queues (used in
=exception processing) that are mapped into the register space. **

	Minor error, there are 21 gp registers, plus a number of special
purpose registers, mostly reserved to supervisor mode.  Several are stacks
for internal state mapped into register slots.  User available registers
are the PC, Trap register, sr2 (has various flags), and the Size register
(determines the size of non-word LD/ST, allows some register remapping,
and a bit for doing 16-bit overflow detection instead of 32).

=8-bit and 16-bit external data are converted into the internal 32-bit
=format by zero-fill (unsigned) or sign-extend (signed).  This is to
=fulfill the DOD requirement for byte and halfword support.  With only
=a single "s" bit in the opcode it is difficult to see how these
=instructions are encoded (load byte, load haldword, load word) "cross"
=(signed, unsigned). **

	There are state bits in the size register that control some of
this.  The 's' bit specifies "load word" or "load not word" (type defined
by size bits, usually you're only playing with one non-word type).

=4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
=    per second.  For the DOD, they benchmarked on a standard US Air Force
=    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
=value on that mix is 14 MIPS", the speaker said.

	DAIS is a 1750a (Air Force Standard CPU) mix of instructions, the
DAIS timings are heavily FPU dependant and are in 1750a MIPS, not RPM-40!

=5.  The GE implementation uses a Harvard bus structure, with completely
=    seperate Instruction Memory and Operand Memory.  GE currently is
=    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
=for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
=At present there is no way to increase the amount of physical memory
=(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
=for "embedded applications".

	Well.... The current board has 128K, but the CPU supports full
32-bit addressing.  Nothing says you can't put more than 128K on it, or use
some sort of external cache.  The only limits are the amount of capacitance
the CPU can drive at 40 Mhz.

=8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
=    doesn't use the target register of a load until > 3 instructions after
=    the load ("3 load delay slots" in some folks' parlance), then there
=is no interlock and instructions are issued one per cycle.  If you use
=the target register of a load <= 3 cycles later, there is a pipeline stall
=while waiting for the Operand Memory to supply the data.

	That is only a software stall, eg NOP-insertion.  Of course, the
reorganizer will try to fill it.  Note that the 7 & 4 cycle figures include
all pipe stages, including the illusionary IF stage.

=Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
=constraints that prevent 1 store per cycle always, nor did they compare
=and contrast loads vs. stores. **

	There are some interlocks with other address-bus using instructions.
You can string as many stores in a row you want, or as many loads.

=9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
=    opcode plus 12 bit coprocessor instruction type) are passed through
=    the CPU, and sent over the address bus to the coprocessor(s).  They
=can be stored in the branch target address cache.  So it _appears_ that
=two cycles are required to do a coprocessor op, one to communicate it
=from the CPU to the coprocessor and one to do it **.  GE didn't say
=whether there were architecturally-visible register files on the
=coprocessors **, but there _appears_ to be an "Xternal Processor Load"
=instruction **.  The Floating Point coprocessor is in fab now and is
=expected out this month.

	The CPU doesn't have to wait, it just issues the instruction over
the address bus.  There is an XPLoad instruction, coprocessor dependant.

=11. A simple virtual memory scheme called "most significant bit replacement"
=    is used.  A process-id is appended to the MSB's of an address before
=    sending it out of the CPU.  A special case occurs when those bits
=are all-0's or all-1's.... ** **

	Tasks can be allocated memory under this scheme in power-of-two
sized chunks == 256 bytes.  Of course, instructions and data have different
mappings.

=++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
=UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
=US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

	I hate to admit this, but it was decided that Core ISA mandated
little-endian memory layout, since several other Core ISA users had implemented
their CPUs that way already when we questioned it.  (Will little-endianism
dog out heels forever? :-)

	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
unix box envirionment 68020.)

{ WARNING:  this is VERY ROUGH, and though I have calulations available that
            say this, they are very back-of-napkin style!  However, it's
	    probably not TOO far off.  Maybe we'll have real performance
	    figures at some point from GE (I don't work there anymore). }

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup