Path: utzoo!mnetor!uunet!husc6!bloom-beacon!mit-eddie!uw-beaver!tektronix!orca!tekecs!frip!andrew
From: andrew@frip.gwd.tek.com (Andrew Klossner)
Newsgroups: comp.arch
Subject: hard data on Motorola 88000
Message-ID: <9916@tekecs.TEK.COM>
Date: 18 Apr 88 19:44:54 GMT
Sender: nobody@tekecs.TEK.COM
Lines: 82

The announcement is today, so I guess it's okay to talk hard data on
the Motorola 88000 architecture.

The 88100, the CPU chip, includes a floating point processor.  The
88200 is the CMMU (cache/memory management unit).  The CPU uses a
Harvard architecture (separate memory ports for instruction and data)
so a minimum configuration is one CPU and 2 CMMUs.  It cycles at 20MHz
initially, with 25MHz expected before long.

The CPU itself, excluding the floating point unit, looks much like
everybody else's RISC CPU.  There are 32 registers, with r0 hardwired
to zero.  (No register windows.)  There is hardware stalling on a
register scoreboard.  ALU instructions take three register addresses,
two operands and a destination.  They all execute in one cycle, except
for integer multiply/divide.  (There is result forwarding, so a
destination register can be used in the next instruction without
stalling.)  Load/store instructions can take a 16 bit offset and an
index register, which can be scaled by a factor of 1, 2, 4, or 8.  To
get to an arbitrary 32-bit address, you need two instructions:

	or.u	r2,r0,hi16(address)	; high 16 bits of address to r2
	ld	r2,r2,lo16(address)	; load word into r2

There is a three-deep pipeline for instruction fetch and a three-deep
pipeline for data fetch/store.  Branch instructions have one delay
slot, and each branch instruction has a bit which means execute the
instruction in the delay slot before branching.  Load instructions take
three cycles if the target memory location is already in cache.  Store
instructions get started in one cycle if the data pipeline isn't full,
otherwise they stall.

The on-chip floating point unit implements floating point
add/subtract/multiply/divide/compare and integer multiply/divide.
Floating point instructions can freely mix single and double precision,
which are the usual IEEE format 32- and 64-bit words.  The add/subtract
portion is separate from the multiply portion and both are pipelined,
so, for example, there can be three multiplies going at one time.  But
the divide instruction takes over the whole FP unit and iterates
through it.  Integer multiply takes 4 cycles; integer divide takes 39.
Single precision add/sub/cmp/mul/convert takes 5 cycles; single divide
takes 30; double add/sub/cmp/convert takes 6; double mul takes 10;
double divide takes 60.  Curiously, an integer divide with a negative
operand traps and makes the kernel complete the operation; I guess
Motorola just ran out of silicon.

Each CMMU has 16k bytes of RAM, organized as a 4-way set associative
cache.  You can have as many as 4 CMMUs on each memory port.  The cache
is by physical addresses, and the cache lookup, hashed on offset within
page, proceeds in parallel with the logical to physical address
translation to get the speed up.  The MMU is a subset of Motorola's
PMMU chip, with the usual two-level page tables and all the necessary
bits (referenced, dirty, etc) in the page descriptor words.  The CMMU
includes a page address translation cache which can describe 56
entries, and a block address translation cache which can be used to
avoid page table walks for memory that's locked down, like kernel code
and data.

A cache line is 16 bytes.  On a cache miss during fetch, the whole line
must be loaded from memory before the fetch is satisfied.  On a cache
miss during store, the whole line is loaded, then the modified word is
written to memory; a cache hit during store does not cause the word to
be written.

The CMMUs include logic to do bus snooping and maintain cache
coherency, so you can throw several CPU/CMMU lashups onto the same
memory bus.  Motorola is playing this up in their advertising, claiming
17 MIPS for one CPU and 50 MIPS for a multi-CPU system.

Unix system V release 3 is up and running (single-CPU).  A reference
port will be sold by either Motorola or Unisoft.  A binary
compatibility standard, which eventually will be blessed by AT&T and be
an ABI, is coming along.

We at Tektronix have been designing a workstation around this chip set
for several months.  I like it.

Don't ask me what price or availability are, I don't know the answers
for the general public.  As a member of the 88open consortium,
Tektronix negotiated favorable terms.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]