Xref: utzoo comp.arch:8619 comp.sys.intel:730
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!sharkey!edsews!uunet!microsoft!jangr
From: jangr@microsoft.UUCP (Jan Gray)
Newsgroups: comp.arch,comp.sys.intel
Subject: i860 overview (long)
Message-ID: <807@microsoft.UUCP>
Date: 6 Mar 89 02:25:04 GMT
Organization: Microsoft Corp., Redmond WA
Lines: 250


				i860 Overview

(what I consider interesting features of the part), taken from the
"i860(tm) 64-bit Microprocessor Programmer's Reference Manual", Order
Number 240329-001, (C) Intel Corp. 1989.

Overview

* 64 bit external data/instruction bus
* 128 bit on-chip data bus
* 64 bit on-chip instruction bus
* 8K data cache, virtual addressed, write-back, two-way "set associative",
  2x128 lines of 32 bytes
* 4K instruction cache, virtual addressed
* 64 entry TLB
* core integer RISC unit
* floating-point unit with pipelined multiply and add units (can also be
  used "unpipelined")
* some multiply-accumulate type floating point instructions
* dual instruction mode can simultaneously dispatch a 32-bit core instruction
  and a 32-bit floating-point instruction

Data Types

* BE bit in epsr (extended processor status register) selects big/little
  endian format in memory, instructions always little-endian
* 32 bit signed/unsigned integers
* IEEE 754 format single (32-bit) and double (64-bit) precision floating
  point numbers
* pixels:
  * stored as 8, 16, or 32 bits (always operates on 64 bits of pixels at a
    time)
  * colour intensity shading instructions treat divide pixels into fields:
    pixel size	colour 1 bits	colour 2 bits	colour 3 bits	other bits
	8	....................N........................	8 - N
	16		6		6		4	0
	32		8		8		8	8
    [These particular field assignments are a result of the pixel add
     instructions described below.]

Memory Management

* NO SEGMENTS!
* 32 bit virtual addresses (translation can be disabled)
* translated identically to 386 virtual address: two level address
  translation, with bits 31..20 of address selecting:
  * dirbase register specifies page directory
  * 1st level: addr[31..22] specifies page directory entry, yielding 
    permissions and address of the second level page table
  * 2nd level: addr[21..12] specifies page table entry, yielding additional
    permissions and address of the physical page
  * addr[11..0]  specifies byte offset within physical page (4K pages)
* page table bits:
  * P  - page is present
  * CD - cache disable: page is not cacheable
  * WT - page is write-through.  disables internal caching.  Either CD or WT
         can be passed through to the external PTB pin, depending upon PBM
	 bit in epsr.
  * U  - user: if 0, page in inaccessible in user mode.
  * W  - writable: if 0, page is not writable in user mode, and may be writable
	 in supervisor mode depending upon WP bit in epsr.
  * A  - accessed: automatically set first time page is accessed
  * D  - dirty: traps when D=0 and page is written
  * two bits reserved, three bits user-definable
  * page directory PTE bits and second level PTE bits are combined in the most
    restrictive fashion
* 64 entry TLB

Caches

* Flush instruction forces a dirty data cache line (32 bytes) back to memory.
  Intel supplies suggested code to flush entire data cache.
* Storing to dirbase register with ITI bit set invalidates TLB and instruction
  caches; must flush data cache first!  [Remember, the data cache is virtually
  addressed.]
 
Core Unit

* Standard 32 bit RISC architecture:
  * 32 32-bit integer registers
  * fault instruction, psr, epsr, dirbase, data breakpoint registers
  * r0 always reads as 0
  * 8, 16, 32 bit integer load/store insns, operands must be appropriately
    aligned; byte or word values are sign extended on load.  [I hope you
    don't use "unsigned char" too much...]
  * 2 source, 1 destination add/subtract/logical (and, andnot, or, xor)
  * No integer multiply/divide instructions.  To multiply, you move the
    operands to floating point registers, use multiply (four insns plus
    five free delay slots).  To divide, you move the dividend to a floating
    point register and multiply by the reciprocal.  This can be very slow
    (59 clocks) if the divisor is a variable (hopefully infrequent).
* 32 bit shift left/right/right-arithmetic, plus 64 bit funnel shift
  ("shift right double").  They ran out of bits to specify two 32 bit sources
  plus destination plus shift count, so the shift count of the last 32 bit
  shift right (automatically stored in the 5 bit SC field of the psr) is used.
* Similar to MIPS Rx000 architecture in some ways:
  * load/store addressing mode is src1(src2), src1 is a register or 16 bit
    immediate constant.
  * form 32 bit constants using andh/andnoth/orh/xorh on upper 16 bits of
    a register
* Only one condition code bit (CC), set in various ways by signed/unsigned
  add/subtract/logical operations, unaffected by shift ops
* Delayed and non-delayed branches on CC set/not set (bc[.t], bnc[.t])
* Non-delayed branch on src1 ==/!= src2 (bte, btne)
* Strange delayed branch "bla" instruction, for one instruction looping.
  useful for aoblss/dsz/isg type looping.  Uses its own special LCC condition
  code bit.  "Programs should avoid calling subroutines while within a bla
  loop, because a subroutine may use bla also and change LCC".  [Ug.]
* Trap, trap on integer overflow instructions
* Call/call indirect, stores return address in r1.
* Unconditional branch, branch indirect, latter also used for return and
  return from trap.
* Core unit loads and stores floating point operands of 32, 64, and 128 bits
* Pipelined floating load instruction (32/64 bits) queues an address of an
  operand not expected to be in cache, and stores the result of the third
  previous pipelined floating load into the destination floating register.
  [This is the data-loading component of the i860 "vector" support.]
* Bus lock/unlock instructions for flexible indivisible read-modify-write
  sequences.  Interrupts are disabled while the bus is locked.  "If ...
  the processor does not encounter a load or store following an unlock
  instruction by the time it has executed 32 instructions, it triggers
  an instruction fault...".
  For example: locked test and set is:
	// r22 <- semaphore, semaphore <- r23
	lock				// next cache miss load/store locks bus
	ld.b	semaphore, r22
	unlock				// next load/store unlocks bus
	st.b	r23, semaphore
* Pixel store instructions for selectively updating particular masked pixels
  in a 64-bit memory location, used for Z-buffer hidden surface elimination.
  Pixel mask is set by fzchk instructions (in floating point/graphics unit)

Floating Point Unit

* 32 32 bit single precision floating point registers, can also be treated
  as 16 64 bit double precision registers.
* graphics operands also stored in the fp registers
* f0/f1 reads as 0
* pipelined multiply and add units
* floating point instructions can be non-pipelined, or pipelined
* Similar to the pipelined load above, in a pipelined multiply or add
  instruction, the source operands go into the pipeline, and the result of
  the 3rd (or so) previous pipelined multiply or add is stored in the
  destination register(s).
* Pipeline lengths
  * adder:     3 stages
  * multiplier:2 or 3 stages (2 double precision, 3 single(!))
  * graphics:  1 
  * load:      3 (loads issued from core unit above)
* IEEE status bits percolate through the fp pipelines, and can be reloaded,
  along with the pipeline contents, after traps
* Divide?  Ha!  If Seymour can do it with reciprocals, so can the i860.
  The frcp and frsqr insns give return approximate reciprocal and 1/square
  root "with absolute significand error < 2^-7".  Intel supplies routines
  for Newton-Raphson approximations that take 22 clocks (*almost* single
  precision) or 38 clocks (*almost* double precision), and the Intel i860
  library provides true IEEE divide.  [RISC design principles at work:
  divides are infrequent enough not to slow down/drop some other feature
  to provide divide hardware.]
* Dual operation instructions (not "dual mode"): Some pipelined instructions
  cause both a pipelined add and a multiply operation to take place.  Since
  the instruction can only encode two source operands, the others are taken
  from temporary holding registers and busses connecting the two units
  in various topologies, depending upon the data path control field of the
  instruction opcode.  [Many real world computations e.g. dot product can
  make use of these instructions.]

Dual Instruction Mode
* DIM allows the i860 to run both a core and a floating/graphics unit insn
  on each cycle.  The resulting 64 bit "wide instruction" must be 64
  bit aligned.
* There is a two cycle latency: two cycles after a floating instruction with
  the D bit set, both a core and a floating insn will be issued.  Similarly,
  if the D bit is clear, there will be no DIM two cycles (two instruction
  pairs) later.
* There are various sensible rules for determining the result of insn pairs
  which set/use common registers, control registers, etc.

Graphics Unit

* Pipelined and non pipelined 64 bit integer add and subtract.
* 16/32 bit non/pipelined Z buffer check instructions:
  "fzchks src1, src2, rdest (16 bit Z-Buffer Check)
   Consider src1, src2, and rdest as arrays of four 16 bit fields
   src1(0..3), src2(0..3), rdest(0..3), where zero denotes the
   least-significant field.

   PM <- PM >> 4
   FOR i = 0 to 3
   DO
     PM[i+4] <- src2(i) <= src1(i) (unsigned)
     rdest(i) <- smaller of src2(i) and src1(i)
   OD
   MERGE <- 0"
  This particular instruction merges four (arbitrary sized) pixels whose
  16 bit Z-buffer values are in one of the (64 bit) sources, and the current
  Z-buffer value in the other source, setting pixel mask bits (controlling
  the pixel store insn described above), and updating the Z-buffer depth
  values.  [Neat!  Just what my (personal) graphics package ordered!]
* Pixel add instructions, which add fixed point values, the results
  accumulating in a special MERGE register.  You can use these to interpolate
  between (for instance) two colours as you scan convert a polygon.
* Z-buffer add instructions, for the analogous case of distance interpolation.

Traps

Briefly, there are instruction, floating point, instruction access, data
access, interrupt, and reset traps.  On a trap, the i860 enters supervisor
mode, saves/modifies various psr bits, saves the faulting instruction address,
and jumps to the trap handler which must be at 0xFFFFFF00.  There are various
complications for dual instruction mode, bus lock mode, and for saving/
restoring the various pipeline states.

Interlocks

The i860 is fully interlocked, so no need to insert nops.  You can, of course,
increase performance by reordering insns with dependencies.  For instance,
in the current implementation, referencing the result of a ld in the next
instruction can cause a one clock delay.

Other interesting timings:
* TLB miss: five clocks plus the number of clocks to finish two reads plus
  the number of clocks to set A (accessed) bit, if necessary.  [I guess Intel
  found Mips' and others' software TLB lookup unworthy...]
* ld/fld following st/fst hit: one clock.
* delayed branch not taken: one clock [to skip/annul the delay slot instruction]
* nondelayed branch taken: bc, bnc: one clock; bte, btne: two clocks
* st.c (store to a control register): two clocks.


Comments

Well, that about does it.  Quite a neat part,  I I think Intel has done
themselves proud with a very clean and well-balanced design; I guess they've
been reading comp.arch... :-)  I had read rumours that this was to be a
floating point coprocessor for the x86, and had feared that it would be
burdened with lots of slave-processor crap, but that is not the case.

If I could change one thing, it would be to add Mips' on-chip external cache
control hardware.  Why hasn't anyone else picked up on this idea?  I'm
afraid that for some code (not *mine*, of course) the 4K on-chip insn cache
will be too small; a cache controller would allow you to add big external
caches with a minimum of heartache.  "I guess there's no pleasing some
people!"


Any typos/misinterpretations are my own.  I speak only for myself.

Jan Gray  uunet!microsoft!jangr  Microsoft Corp., Redmond Wash.  206-882-8080