Xref: utzoo comp.arch:8629 comp.sys.intel:733
Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!sun!pitstop!sundc!seismo!uunet!microsoft!w-colinp
From: w-colinp@microsoft.UUCP (Colin Plumb)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: i860 overview (very very long)
Keywords: i860, 80860, iapx860, timing, opcodes, instructions, menmonics,
	pipelines
Message-ID: <808@microsoft.UUCP>
Date: 6 Mar 89 14:00:56 GMT
References: <807@microsoft.UUCP>
Reply-To: w-colinp@microsoft.uucp (Colin Plumb)
Organization: very little
Lines: 1121

Well, I just got Jan's copy of the "i860 64-bit Microprocessor Programmer's
Reference Manual" and am going to post an even longer summary.  I'll try
to avoid too much duplication.

Personal flames:
The exception handling is a disaster.  On any sort of exception, the processor
switches to supervisor mode and jumps to virtual address 0xFFFFFF00.  Then
you have to stare at the bits in the status register to figure out what
happened, handle it, and do arcane things to get the processor in a state
such that it can restart.  This involves looking at the instruction that
faulted and the one just before and parsing them a bit.  Bleah!
Since the processor doesn't handle denormalised, infinity, or NaN values
in the floating-point unit, causing a trap, and the business of sticking
the right value into the pipeline is so tedious, you basically need to
avoid these things altogether.

Also, interrupt return is wierd.  It's overloaded onto the branch indirect
instruction.  If the status register indicates that you're inside a trap
handler, it does some interrupt-return things in addition to branching
to an address specified in a register.

Integer divide is done by converting to floating point, doing the Newton-
Raphson bit, and converting back.  The sample code they give requires 62
clocks (59 without remainder).  Can you say "divide step" boys and girls?

The instruction in the delay slot of a control transfer must not be another
control transfer instruction, including a trap.  This makes me wonder if
putting a trap there could sufficiently confuse the processor that I'd
end up in my code in supervisor mode.  No reason to believe so, just a
nasty idea that popped into my head.  (The first rule of root-crackers:
look for something which says "do not do x."  Try as many variations of x
as possible.)

System regsiters may be read in user mode, and writes are simply ignored.
So much for virtual machines!

Some floating-point instructions can't be pipelined; others must be.
Annoying.

This is Intel order number 240329-001, copyright Intel 1989.
Other related documents:
i860 64-bit Microprocessor (data sheet), order number 240296
i860 Microprocessor Assembler and Linker Reference Manual, 240436
i860 Microprocessor Simulator-Debugger Reference Manual, 240437

The manual I have says absolutely nothing about pinout, timing, or any such
electrical thing.

Anyway, on to the meat:

>> Introduction and Register Summary <<

There are 32 32-bit integer registers and 32 32-bit fp registers.
The fp registers are used in even/odd pairs for double-precision
operations.  The even register appears in memory at the lower address.

Other registers are the psr (processor status register, 32 bits), and epsr
(32 more bits), the db register (debugging, specifies an address to
breakpoint; reads and writes can be trapped), the dirbase register (root
pointer for page tables), fir (fault instruction register, saved PC
on a fault), fsr (floating-point status register), three special-purpose
registers (64 bits) for use in pipelined floating-point mode: KR, KI, and T,
and a 64-bit MERGE register used in pixel operations.

The psr bits, lowest to highest are:
 0: BR, Break Read
 1: BW, Break Write - these bits control breakpoints used with the db register.
	When one is set, the corresponding access to the address specified in
	the db register causes a trap.  The db register specifies a byte
	address; any access touching that byte will be trapped.
 2: CC, Condition Code - there is only one CC bit, set by the add and subtract
	instructions as a greater than/less-than flag, and by the logical
	(and, or, xor, andnot) instructions as a zero flag.
 3: LCC, Loop Condition Code - this is used by the bla instruction only
	to do add-compare-and-branch type things.
 4: IM, Interrupt Mode - external interrupt enable bit.
 5: PIM, Previous Interrupt Mode - state of the IM bit before the last trap.
 6: U, User - set in processor is in user mode.
 7: PU, Previous User - copy of U bit as of before last trap.
 8: IT, Instruction Trap - set by the processor when a trap ocurrs if the
	current instruction caused a trap.  Breakpoints and the like.
 9: IN, INterrupt - set when a trap ocurrs if an external interrupt is
 	a contributing factor.
10: IAT, Instruction Access Trap - as above, set if there was an address
	translation problem during instruction fetch.  There is no mention
	of a BERR-like pin.
11: DAT, Data Access Trap - as above, but for data accesses.  This bit is
	also set by unaligned load/stires and BR/BW exceptions.
12: FT, Floating-point Trap - set if a floating-point error contributed to
	a trap.  Note that any combination of the trap bits may be set on
	entry to the interrupt handler at 0xFFFFFF00.  No bits set indicates
	reset/power-up.  Multiple bits set indicates multiple simultaneous
	exceptions.
13: DS, Delayed Switch - there is a 2-cycle latency between the first
	instruction in the stream with the double-instruction mode but set
	and the processor starting to execute two instructions per cycle,
	or between the first instruction with the bit clear and the cessation
	of double-instruction mode.  This bit is set when the first cycle of
	latency has passed, but not the second.  Note that it is set for
	both switching to dual-instruction mode and away.  The direction
	is given by the DIM bit.
14: DIM, Dual-Instruction Mode - set if the processor is in dual-instruction
	mode, executing an integer ("core") instruction and a floating-point
	one in a single cycle.  It only set when a trap ocurrs; it does not
	reflect the current state of the processor, but the one before the
	trap.  The same goes for the DS bit.
15: KNF, Kill Next Floating - on trap return, if this bit is set, the next
	floating-point instruction is ignored (except for its dual-instruction
	bit).  Useful when emulating a floating-point instruction that trapped
	in dual-instruction mode, when you want to retry the "core" integer
	instruction but not the fp one.
16: X - Unused.  Undefined when read, write with 0 or saved value.
17..21: SC, Shift Count - remembers the shift used in the last SHR (shift
	right logical) instruction.  Specifies the shift count for the
	SHRD (shift right double - extract 32 bits from 64) instruction.
	Equivalent to the 29000's FC register.
23..22: PS, Pixel Size - specifies the size of a pixel for graphics operations.
	0 through 3 mean 8, 16, 32, or <undefined> bit pixels.
24..31: PM, Pixel Mask - the pixel store instruction stores those pixels in
	the current 64-bit word specified by the low-order bits of this field.
	These bits can be set by various z-buffer instructions.

The PM, PS, SC, CC, and LCC fields can be set from user mode; writes to the
other fields are ignored in user mode.

The epsr bits, lowest to highest are:
 0..7: Processor type - specifies the type of the current processor.  1
	for the i860. (Hardwired, may not be changed even by supervisor)
 8..12: Stepping number - specifies the revision. (Also hardwired)
13: IL, InterLock - set if a trap ocurrs in the middle of a lock/unlock
	sequence.
14: WP, Write Rpotect - if clear, supervisor-mode accesses ignore the write
	protect bit of a TLB entry.  If set, even supervisor-mode writes
	are disallowed.
15..16: X, unused.
17: INT, INTerrupt - the value of the INT input pin.  It looks like there
	is only one, and this bit is unqualified.  Writes are ignored.
18..21: DCS, Data Cache Size - the size of the data cache.  2^(DCS+12)
	bytes.  Currently 1, meaning 8Kbytes.  Hardwired.
22: PBM, Page-table Bit Mode - determines which of two bits in the page table
	entry is reflected on the PTB pin.  If 0, the CD bit; if 1, the WT bit.
23: BE, Big-Endian - set if the processor in in big-endian mode.  Causes the
	low 3 bits of the address bus to be complemented.
24: OF, OverFlow - set or cleared by the add and subtract instructions if
	signed or unsigned (depending on the instruction) occurs.  There is
	an instruction analogous to the 68000's TRAPV which traps if this bit
	is set.
25..31: X, unused.

OF is user-writable; the other fields are only writeable from supervisor mode.

The db, Data Breakpoint register contains a byte address which is watched
for accesses.  If any access touches this byte and the corresponding
bit in the psr is set, a data access trap occurs.

The dirbase, directory base register points to the root of the page table
tree.  Standard two-level page table with 4K pages.

The bits, lowest to highes, are:
 0: ATE, Address Translation Enable - if set, address translation is enabled.
	You must flush the data cahce before fiddling with this bit.
 1..3: DPS, DRAM Page Size - the i860 has support for page-mode or
	static-column DRAMs.  If two accesses differ only in the low-order
	12+DPS bits, the NENE# pin is asserted.  Zero is used for one bank
	of 256Kxn DRAMs.
 4: BL, Bus Lock - echoed to the outside world on the LOCK# pin, after one
	cycle of latency.  Controlled by the lock and unlock instructions.
	Copied to the IL bit of the epsr and cleared on a trap.
 5: ITI, Instruction cache and TLB Invalidate - when a 1 is written, the
	instruction cache and page table cache are invalidated.  Always
	reads as zero.
 6: X, unused.
 7: CS8, Code Size 8 bits - when set, instruction fetches are done from
	8-bit-wide memory instead of 64.  Used for bootstrapping from
	a ROM.  Once cleared, cannot be reset.
 8..9: RB, Replacement Block - can control which block (set) of a cache
	is replaced on a miss.  Used by the data-cache flush instruction,
	and for testing in conjunction with the next field.  For the data
	and instruction caches, which are 2-way set-associative, only the
	low bit is used.  For the TLB, both bits are used.
10..11: RC, Replacement Control - the i860 normally uses random replacement
	on all of its caches.  For testing, you can replace this with a
	deterministic algorithm.  00 is normal, 01 causes all cache
	replacements to use the set specified in the RB field, 10 causes
	the data cache to obey the RB field, and 11 disables data cache
	replacement.  I think this means hits can still occur, but no
	new information will be added to the cache.
12..31: DTB, Directory Table Base - this, with 12 low bits of 0, is the address
	of the first-level page table.

The fir (Fault Instruction Register) holds the (virtual, I think) address of
the instruction that casued the trap.  The first time it is read, it is
unfrozen, and subsequent reads will just get the address of the load
instruction.

The fpsr contains all the floating-point flags:
 0: FZ, Flush Zero - if set, underflow is flushed to zero instead of raising
	a result-exception trap.
 1: TI, Trap Inexact - if set, inexact results cause a trap.
 2..3: RM, Rounding Mode - 0 through 3 mean round towards nearest, -inf, +inf,
	and 0.
 4: U, Update - always reads as zero; if set on a write of this register,
	bits 9 through 15 and 22 through 24 are written.  If clear, the
	data written to them is ignored and they are unchanged.
 5: FTE, Floating-point Trap Enable - if clear, floating-point traps are
	never reported.  Used when mucking with the pipeline in various
	ways, and when sticking software-emulated values into the fp unit.
 6: X, unused.
 7: SI, Sticky Inexact - set whenever an inexact result is generated,
	regardless of the state of the TI bit.  Cleared only by explicit
	write.
 8: SE, Source Exception - set when one of the inputs to a FLOP is invalid
	(infinity, denormal, or NaN).
 9: MU, Multiplier Underflow - on read, indicates the last multiply operation
	to come out of the pileline underflowed.  On write (note only written
	if the U bit is set), this forces the flag on the operation in the
	first stage of the multiply pipleline.  When that operation reaches
	the end of the multiply pileline, this is the MU bit that will
	come out.  Used for reloading the pipeline.
10: MO, Multiplier Overflow - similar to above.
11: MI, Multiplier Inexact - similar to above.
12: MA, Multipler Add-one - similar to above, but indicates that the multipler
	rounded up instead of down.  I'm not sure of the effects in the
	presence of sign bits.
13: AU - Adder Underflow - similar to MU.
14: AO - Adder Overflow - similar to MO.
15: AI - Adder Inexact - similar to MI.
16: AA - Adder Add-one - similar to MA.
17..21: RR, Result Register - holds the destination of a FLOP when an exception
	occurs due to a scalar op.
22..24: AE, Adder Exponent - holds the high 3 bits of the exponent coming out
	of the adder.  Used to handle exceptions involving double-precision
	inputs and single-precision outputs properly.
25: X, unused.
26: LRP, Load pipe Result Precision - holds the precision of the value in
	the last stage of the load pipeline (more about this later); set on
	dp, clear on sp.  It cannot be set by software (except by stuffing
	things into the pipe) and is provided to help save the state of the
	pipe.
27: IRP, Integer pipe Result Precision - holds the precision of the
	value in the graphics pipeline.
28: MRP, Multiplier Result Precision - holds the precision of the value
	in the last stage of the multipler pipeline.
29: MRP, Adder Result Precision - holds the precision of the value in the
	last stage of the adder pipeline.
30..31: X, unused.

The KI and KR registers hold constant inputs to the multiplier for use in
the multiply-accumulate instructions.  The T register is between the
multiplier and the adder, again for multiply-accumulate instructions.

The MERGE register is used by graphics operations.

The bits in a page table entry (on either level) are:
 0: P, Present - if clear, the entry is invalid and the high 31 bits are
	available to the programmer.
 1: W, Writeable - if set, the page is writeable.  This is always enforced
	in user mode, and enforcement in supervisor mode is controlled
	by the WP bit of the epsr.  The effective value of this bit
	for any page table entry is the AND of the bits at the two
	levels of page tables.
 2: U, User - if set, this page is accessible at user level.  Must be
	set at both levels of page table for the page to be accessible.
 3: WT, Write-Through - if set, this page will not be cached by the internal
	data cache.  This bit can also be echoed externally, of the PBM bit
	of the espr is set.  This bit is only used on the second (lower)
	level of page tables.  On the first level, it is reserved.
 4: CD, Cache Disable - if set, this page will not be cached by either internal
	cache.  This bit can also be echoed externally, of the PBM bit of the
	espr is clear.  This bit is only used on the second (lower) level of
	page tables.  On the first level, it is reserved.
 5: A, Accessed - only used in the second level of page tables, and is set
	whenever the page is loaded into the TLB.
 6: D, Dirty - only used on the second level of page tables.  If clear, and
	the page is being written to, a data access fault is generated.
	(I.e. must be maintained by software.)
 7..8: X, unused.
 9..11: Reserved for use by the OS.
12..31: high-order bits of the physical address of the page or next level of
	page tables.

There is an external pin, KEN#, that must be asserted to enable cacheing
of instructions and/or data.  If not asserted, the data is not put into the
cache.  The data cache must be explicitly flushed by software before you
may change page tables.

>> Integer ("core") instructions <<

There are a few instruction formats.  The most common has, high to low,
6 bits of opcode (of which the low bit, bit 25, is usually clear), 5
bits of src2, 5 bits of dest, 5 bits of src1, and 11 bits of immediate
offset.  Call this format A.

Next most common has 6 bits of opcode (low bit usually set), 5 bits of
src2, 5 bits of dest, and 16 bits of immediate constant (frequently
ignored; set to 0).  Call this format B.

A few instructions have the high 5 bits of a 16-bit immediate offset in
the dest field (bits 16..20) rather than the src1 field.  Store and a few
branch instructions.  Call this format C.

A few instructions are like the above, and also have a 5-bit immediate
constant in the src1 field.  Call this format D.

The largest branch offsets are handled by instructions having 6 bits of
opcode and 26 bits of (signed) offset.  Call this format E.

Floating-point instructions have 010010 in the high 6 bits, 5 bits of
src2, 5 bits of dest, 5 bits of src1, 4 magic bits (P - pipeline, D -
dual-instruction mode, S - source precision, and R - result precision),
and 7 bits of opcode.  Call this format F.

Some special operations have 010011 in the 6-bit opcode field, 5 bits of
src1 in bits 11..15, and 5 bits of opcode in bits 0..4.  The other bits
are unused.  Call this format G.

> Load
The load instruction (ld) has 3 variants: ld.b, ld.s, and ld.l.
dest = mem[src1 + src2].  The opcode is 000L0I.  I controls src1.
0 if it's a register (format A), 1 if it's a 16-bit signed offset
(format B).  L is 0 if it's a byte load, 1 if it's a 16 or 32-bit
load.  In the latter case, bit 0 of the instruction is stolen from
the immediate offset and indicates 16 (0) ir 32 (1) bit loads.
This bit is connsidered to be 0 for the addition.  Loads are always
sign-extended.

BTW, the Intel-suggested format is ld.x src1(src2), dest.  No more
mov dest, src nonsense!

There is 1 cycle of latency on loads, even if they hit the cache.  This
is interlocked.

> Store
Store is similar (st.b, st.s. or st.l), but it must use a 16-bit
immediate offset, and uses format C.  mem[src2 + immediate] = src1.
The opcode is 000L11.  (Again, st.x src1, imm(src2).)

Note that r0 hardwired to zero comes in handy here.  32-bit absolute
addresses must be formed by loading the high 16 bits into a register and
taking an offset from there.  Intel suggests r32 for this purpose.
Because the offsets are signed, you may have to diddle the high 16 bits
a bit to make things work properly.

> Move int to fp
ixfr src, fdest moves 32 bits (nor format conversion) from an integer
to an fp register.  The opcode is 000010.  There are two cycles of latency
(interlocked) until the data appears in the destination.  Format A.

000110 is reserved.

> Fp load
fld.y src1(src2), fdest and fld.y src1(src2)++, fdest are floating-point
load instructions.  They are simialr to the integer load instructions,
except the data sizes are .l (32 bits), .d (64), or .q (128 bits).
The ++ autoincrement mode stores src1+src2 back into src2.  Again, the
low-order bits of the immediate offsset (formats A or B) are used in
the instruction.  Bit 0 is set for autoincrement addressing, bit 1 is
set for 32-bit loads, and bit 2 is set for 128-bit loads.  Bits 1 and
2 are zero for 64-bit loads.  The opcode is 00100I, I selecting format
A or B.  For 64 or 128-bit loads, the destination register must be
even or a multiple of 4.

> Fp store
fst.y fdest, src1(src2) and fst.y fdest, src1(src2)++ are similar.
src1 can be a register here.  The opcode is 00101I.

> Pipelined fp load
pfld.z fdest, src1(src2)[++], pipelined load, has the same addressing
modes, except they use the 3-deep load pipeline (which operates
independently of scalar loads; the two may be arbitrarily interleaved).
The destination register specifies where to put the result of the 3rd
previous pfld instruction.  128-bit pipelined loads are not allowed.
Pipelined accesses do not place the data in the cache, although they
do handle cache hits properly.  The opcode is 01100I, I selecting format
A or B.

> Pixel store
pst.d freg, imm(src2)[++] stores the pixels specified by the PS field
of the psr from the 64-bit freg into memory.  If you have 16-bit pixels
and the low bits of the PS field are 0110, only the middle 4 byte strobes
will be asserted on write.  Bit 0 corresponds to the lowest address.
See the fzchks ahd fzchkl instructions for uses for this.  The opcode
is 001111, and it uses format B.  After the store, the PM field is shifted
right by the number of bits used, so the next pst instruction will
have access to the next bits.

> Add, Subtract, Compare
The add/subtract instructions also double as compare instructions.
There's addu src1, src2, dest, adds, subu, and subs.  They do the
obvious things, also setting the OF flag on overflow, and setting
the CC flag as follows:
addu: CC gets the carry from bit 31.
adds: CC gets (src2 < -src1)
subu: CC getsthe carry from bit 31 (src2 <= src1)
subs: CC gets (src2 > src1)

This uses formats A and B.  If the 16-bit immediate is used, it is sign-
extended.  To get the one's complement, use subs -1, src2, dest.

The opcode is 100UAI, where U is 0 for unsigned, 1 for signed; A is
0 for add, 1 for subtract, and I selects format A or B.  (0 for A, 1 for B.)

> Shift, Rotate
The shift instructions are shl, shr, shra, and shrd.  The first three do
the obvious things, shr zero-filling high-order bits and shra sign-extending.
shr *only* copies its src1 (register or 16-bit immediate, with the high 11
bits ignored) to the SC field of the PSR.  shrd uses this to compute
dest = ((src1<<32 + src2) >> SC ) & 0xFFFFFFFF.

The opcdes are:
10100I for shl, I selects format A or B.
10101I for shr, I selects format A or B.
10111I for shra, I selects format A or B.
101100 for shrd, format A only.

None of the shifts set the condition code bit, so the assembler uses the
following macros:

mov src2, dest == shl r0, src2, rdest
nop == shl r0, r0, r0
fnop = shrd r0, r0, r0

To do a rotate, shr count, r0, r0 and shrd src, src, dest.

> Trap
There is a trap instruction, which uses format B, although the source
operands are not interpreted.  The destination register is "undefined,"
so it's a good idea to use register 0.  The opcode is 010001.  This
causes an IT trap (see the psr).  The source bits can be used for
whatever.

> And, Or, Xor, Andnot
There are 4 logical instructions, and, or, xor, and andnot.  They do the
obvious things, dest = src1 & src2, dest = src1 | src2, dest = src1 ^ src2,
dest = ~src1 & src2.  The opcodes are of the form 11OPHI, where OP
specifies one of {and, andnot, or, xor}.  I specifies A or B format,
and H can be set in B format to indicate that the immediate constant
should be shifted up 16 bits before use.  Thus, to load the high 16 bits
of a register, orh immediate, r0, dest will do the trick.  Or xor.
H bit set and I bit clear is reserved.  16-bit immediate values are
zero-extended, if used.

The CC flag is set if the result is zero, otherwise it is cleared.  The
opcodes with the H bit set are andh, andnoth, orh, and xorh.

> Control register modification
ld.c and st.c are used to modify control registers.  Format A is used,
although only the src2 one of src1 (for st.c) and dest (for ld.c)
are interpreted.  The opcode is 0011L0, where L is 0 for load
(dest = special[src2]) and 1 for store (special[src2] = src1).
The src2 field holds 0 through 5 for the fir, psr, dirbase, db, fpsr,
and epsr registers.  These instructions are legal in user mode, although
many writes will be ignored.

> Branches
Most of the branch instructions use format E, taking a 26-bit offset.  I assume
the offset is a word offset (in case I forgot to mention it, instructions
must be 32-bit word-aligned), but I can't find it explicitly stated.  br
(opcode 011010) is a straight branch, with 1 delay slot.

> Call
call (opcode 011011)
is similar, but also puts a return address in register r1.
bc and bnc branch if the CC flag is set or clear, respectively.  They
come in non-delayed (bc, bnc) and delayed (bc.t, bnc.t) versions.  The
opcodes are 01110T and 01111T, respectively.  T is set in the .t (delayed)
forms.

bri is an indirect branch, delayed, using opcode 010001 and format A,
I think, although only src1 is used.  bri [src1] branches to the address
specified in src1.  The low two bits of src1 are ignored.  If any of the
trap bits are set when this instruction is executed (see the psr), this
also performs an interrupt return, clearing the trap bits, copying PU
to U and PIM to IM, and doing strange things with DS and DIM.

> Loop
There's also bla, a looping-type instruction.  It's a bit wierd.
First of all, if the LCC flag is set, it does a delayed branch
with a 16-bit offset, then it computes "adds src1, src2, src2" and
sets the LCC flag to what the complement of the CC flag would be for
a real adds.  This uses format C, with opcode 101101.

Intel gives the example of clearing an array of 16 single-precision
numbers to zero, atarting at the address in r4:

	adds	-1, r0, r5	// r5 holds loop increment
	or	15, r0, r6	// r6 holds loop count
	bla	r5, r6, CLEAR_LOOP	// clear LCC; it doesn't matter
					// if we jump or not
	addu	-4, r4, r4	// compensate for preincrement (delay slot)
CLEAR_LOOP:
	bla	r5, r6, CLEAR_LOOP
	fst.l	f0, 4(r4)++	// delay slot

I've never seen a looping instruction quite like it.  Be careful not to
trash LCC during the loop (shades of jcxz!).

Other core instructions:

> Compare-and-branch
bte src1, src2, offset and btne src1, src2, offset branch (no delay)
if src1 == src2 or src1 != src2, respectively.  They have opcode 0101EI,
where E = 1 branches on equal, and I selects format C or D.  I.e. src1
can be an immediate value, but only in the range 0..31.

> Flush
flush - flush the data cache. "In user mode, execution of flush is suppressed,"
whatever that means.  What it seems to do is force a fake load, one that fills
the data cache with garbage.  "When flushing the cache before a task switch,
the addresses used by the flush instruction should reference non-user-
accessible memory to ensure [Will wonders never cease?  A book written in the
U.S. actually got ensure/insure straight!] that cached data from the old task
is [oh, well... can't win them all] not transferred to the new task.  These
addresses must be valid and writeable in both the old and the new tasks's
space."

The sample code reserves a 4K hunk of memory, and does this:

// Rw, Rx, Ry, and Rz are registers
// FLUSH_P_H and FLUSH_P_L are two halves of the address of the 4K hunk,
// less 32.

	ld.c	dirbase, Rz	// assuming RB and RC fields clear
	or	0x800, Rz, Rz	// Set RC field to 2 (obey RB for data cache)
	adds	-1, r0, Rx	// Loop increment
	call	D_FLUSH	
	st.c	Rz, dirbase	// Store new RC field (in delay slot of call!)

	or	0x900, Rz, Rz	// Set RB field to 2 (was assumed 0)
	call	D_FLUSH
	st.c	Rz, dirbase	// Store new RB field (in delay slot of call!)
	xor	0x900, Rz, Rz	// Clear RB and RC fields
// Pound on DTB, ATE, or ITI fields here
	st.c	Rz, dirbase	// Store cleared values
// continue...

D_FLUSH:
	orh	FLUSH_P_H, r0, Rw
	or	FLUSH_P_L, r0, Rw	// Rw gets address of flush area
	or	127, r0, Ry		// loop counter
	bla	Rx, Ry, D_FLUSH_LOOP	// set up LCC
	ld.l	32(Rw), r0		// clear pending bus writes
D_FLUSH_LOOP:
	bla	Rx, Ry, D_FLUSH_LOOP	// Loop
	flush	32(Rw)++		// Hit every 32 bytes (cache line size)
	bri	r1			// Return - branch to (r1)
	ld.l	-512(Rw), r0		// Load from flush area to clear pending
					// writes (guaranteed cache hit).
	
I don't quite understand the bit about clearing pending writes.  I guess
it puts off address translation until the last possible moment (the write
queue uses virtual addresses), and a load to r0 is an idiom which always
generates an interlock.

The flush instruction uses opcode 001101; format B.  Bit 0 of the immediate
field selects autoincrement mode.

That's everything in formats A through E; now for format G.  (High 6 bits
opcode = 010011, low 5 bits give secondary opcode; only one 5-bit register
field defined.)

The defined operations are:
calli: opcode 00010, performs an indirect (delayed) call via the address 
specified in the register operand.  I don't know if it reads the source
register before or after storing the return address in r1.  Could be a
way to play with coroutines.

intovr: opcode 00100, traps if the OF flag in the espr is set.  trapv.

> Lock
lock: opcode 00001.  This is interesting.  This begins an interlocked
sequence on the next data access that misses the cache, setting the BL bit.
Interrupts are disabled and the bus is locked until explicitly unlocked.
The sequence must be restartable from the lock instruction in case a
trap ocurrs.  If there is more than one store, you must ensure there are
no traps after the first non-idempotent store.  I.e. keep the code on one
page and make sure all the data addresses are valid.

There is a similar unlock instruction (opcode 00111), that unlocks the bus
on the first data access that misses the cache after it.

These instructions *are* executable from user mode, but there is a 32-
instruction counter that traps if you spend too long with the bus locked.

I like those instructions.  A RTOS might like to be able to set the timeout,
but 32 instructions is a reasonable value.

Now, for the interesting part:

>> Floating-Point <<

These are all in the F format, with a 010010 opcode in the high 6 bits,
then 5 bits of src2, dest, and src1, then 4 magic bits, then 7 bits
of fp opcode.  Two of the magic bits control the source and destination
precisions.  S=0 for single and S=1 for double sources.  R=0 for
single and R=0 for double results.

> Pipelines
Here comes time to explain the pipeline concepts used by the 80860.

There are 4 pipelines on the i860: multiplier, adder, graphics unit,
and floating-point loads.  These are 2/3, 3, 1, and 3 stages deep.
The multiplier is 2 stages deep for double-precision sources and 3
stages [sic] for single.  The destination format is unimportant.

The FZ (flush zero), RM (rounding mode) and/or RR (result register) bits
of the fsr while there are results in the adder or multiplier pipelines
is a bad idea.

One of the magic bits in each fp instruction is the P, pipeline bit.  If
this bit is clear, the operation goes straight through the floating-point
unit.  Any results in the pipeline are lost, but the result is available by
the next instruction.  This is *not* the next cycle, but it's scoreborarded.
(This doesn't apply to the load pipeline, which is not used by scalar load
instructions.)

If the pipeline bit is set, though, then the specified dest is for the
result at the end of the pipeline and the requested operation goes in the
front.  The store is completed before the load of the source operands.
(At least conceptually.)

So initially, you must stick a few operations into the pipeline, throwing
away whatever was there (writing it to f0), then you can pump through
lots of data, then you have to stick in a few junk computations to get the
last few results.

The load pipeline, the pfld instruction, is the most straightforward,
and works as described above.

On the multiply pipeline, if you switch source precisions with the pipeline
half-full, if you started out in double (2-stage) mode with B and A in the
pipeline (A one stage from completion, B two), and added single-precision
computation C, you'd store A and end up with C, B and 0.0 in the pipeline.

If you started out with C, B, A, and added double-precision computation D,
you'd end up with A stored and D, C in the pipeline.  B would get lost.

Both inputs to an operation must be of the same precision.  There are odd,
not fully explained problems with taking double source operands and
returning a single result, so the precision suffixes on floating-point
operations should generally be restricted to .ss, .sd, and .dd.

> Fmul, Fadd, Fsub
Anyway, here's a list of the simple floating-point operations:
[p]fmul.p	src1, src2, dest (opcode 0100000)
[p]fadd.p	src1, src2, dest (opcode 0110000)
[p]fsub.p	src1, src2, dest (opcode 0110001) // result = src1 - src2

The fadd or pfadd instruction may have a .ds precision suffix, as long as
one of the sources is f0.  This is used for format conversion.  The [p]fadd
instructions are used in the [p]fmov macros.

> Float to integer
[p]ftrunc.p	src1, dest (opcode 0111010)
The result of this operation is 64 bits, whose low 32 bits are the integer
(truncated) part of the floating-point src1.  It uses the adder.
[p]fix.p	src1, dest (opcode 0110010)
Same as a bove, but the integer part is rounded.  For both of these, the
integer is two's complement, signed.

pfmul3.dd	src1, src2, dest (opcode 0100100)
This forces a dp multiply to use the 3-stage pipeline.  It's only intended
for reloading a pipeline.

> Multiplty (integer)
fmlow.dd	src1, src2, dest (opcode 0100001)
This multiplies only the low-order bits of its operands.  dest gets the
low-order 53 bits of the product of the significands of src1 and src2.
Bit 53 of dest gets the MSB of the product.  This instruction cannot be
used in pipelined mode, does not affect the result-status bits in the fpsr,
and does not cause any traps.

> Divide, Reciprocal
frcp.p		src2, dest (opcode 0100010)
dest = 1/src2, approximately.  Absolute significand error < 2^-7.
src1 must be zero.  Use as a starting point for Newton-Raphson.
This instruction may not be pipelined.  It causes a source-exception
trap if src2 is zero.  It uses the multipler.

> Square root
frsqr.p		src2, dest (opcode 0100011)
As above, but dest = 1/sqrt(src2), approximately, and it also traps if
src2 < 0.

> Fcmp
pfgt.p		src1, src2, dest (opcode 0110100, R bit clear)
pfle.p		src1, src2, dest (opcode 0110100, R bit set)
pfeq.p		src1, src2, dest (opcode 0110101)
These instructions perform floating-ponit comparison using the adder.
They begin with "p" because they advance the pipeline one stage
(the value they insert is undefined, but not an error), but they
place the result of the comparison (src1 > src2, src1 <= src2,
src1 = srcs) in the CC bit immediately  There is no pipeline delay.
(Actually, there is one cycle of latency, but it's scoreboarded.)
They do trap on invalid inputs.

> Multiply-accumulate
The following instructions are called dual-operation instructions, since
they use both the adder and multiplier.  Not to be confused with dual-
instruction mode.  Combining both of these gives the calimed 150 MOPS.

pfam.p	 	src1, src2, dest (opcode 000xxxx)
pfmam.p	 	src1, src2, dest (opcode 000xxxx)
pfsm.p	 	src1, src2, dest (opcode 001xxxx)
pfmsm.p	 	src1, src2, dest (opcode 001xxxx)

These instructions are really complex families of instructions.
They perform variations on multiply-accumulate.  The xxxx is the DPC
(Data-Path Control) field.

The precision specifies the input and output precisions of the multiplier;
the adder takes inputs and putputs of the destination precision.

Here is where the KI, KR, and T registers come in.  The possibe data flows
are complex, but:

The value written into dest can be the result of either the adder or
multiplier pipeline.

The multipler's src1 can be the instruction's src1, KI, or KR.  If it is one
of the K registers, the instruction's src1 can be copied into it prior to
use, or you can use it's current value.

The multiplier's src2 can be the given src2 or the value written into
dest.

The multipler's result can be written into the T register as well as sent
to the destination register.

The adder's src1 can be the instruction's src1 (if the multiplier hasn't
usurped it), the value written into dest (again, if nobody else has it),
or the value in the T register (which can be whatever it used to be or the
value written by the multiplier).

The adder's src2 can be the result of the multipler, the value written into
the dest register, ot the given src2 (assuming the multipler hasn't
stolen it).

When you add the fact that the adder can compute src1+src2 or src1-src2,
you have a total of 64 possibilities.

A bit in the opcode specifies whether the adder adds or subtracts, and the
P bit is used to specify which output goes to the dest register (0 = adder,
1 = multiplier (and the adder's result is thrown away)).

After this factoring, there are 16 cases, 8 can be represented by the
DPC field values 0XYX, where:
X controls whether "K" means KR (X=0) or KI (X=1),
Y controls whether the adder's src2 is the result of the multiplier (Y=0)
or the result of the multiplier goes into T and the adders' src2 is
the result that gets written into the dest register (Y=1), and
Z controls whether the instruction's src1 goes to the adder's src1
(Z=0) or the instruction's src1 goes to K (and thence to the multiplier)
and the adder's src1 comes from T (which may have come from the
multiplier).

DPC values of the form 1XY0 cover cases where the multiplier's inputs are
K and the result written to dest (K is controlled by X, as above) and the
adder's inputs are the instruction's src1 and src2.  Y controls whether
T is loaded with the result of the multiplier (Y=0) or not (Y=1).

DPC values of the form 1XY1 cover cases where the multiplier's inputs are
the instruction's src1 and src1.  If X is 1, then the adder's src1
is T (which is not loaded from the multiplier's result) and Y controls
whether the adder's src2 is the multiplier's result or the value written
to the dest register.  (Note that these may be the same value.)

If X is 0, then the adder's second input is the result of the multiplier
(which is not written into T), and its first input is controlled by Y.
If 0, it's the valuee written into the dest register; if 1, it's the
T register.

Are you suitably confused?  Pictures do help somewhat.  Intel supplies
transliteration rules for producing mnemonics from these various
connections, but I won't go into them here.

Scoreboard alert: when the multiplier's src1 is the instruction's src1,
this must not be the same as rdest.  Something screws up.

>> Graphics operations <<

These also use the fp instruction encoding and register set.  But they use
a separate graphics pipeline which is only one stage deep - i.e. when you
start one instruction, you get the result of the previous one out.
As with the floating-point instructions, most have pipelined and non-
pielined versions, which behave analogously.
(The graphics operations use fp opcodes 1xxxxxx; I've already covered
everything of the form 0xxxxxx.)

> Long long
The basic ones are long-integer operations:
[p]fiadd.w	src1, src2, dest (opcode 1001001)
.w is .ss or .dd for 32 or 64-bit adds.  The CC is not set, and no traps
are signalled.

[p]fisub.w	src1, src2, dest (opcode 1001101)
dest = src1-src2

There are move macros that use these instructions with f0.

> Z-buffer
[p]fzchks	src1, src2, dest (opcode 1011111)
[p]fzchkl	src1, src2, dest (opcode 1011011)
These instructions do z-buffer operations.  The short form takes the sources
as 4 fields of 16 bits each, and does 4 simultaneous compares, with the
results written to the PM (Pixel Mask) field of the psr.  In fact, what
happens is that the PM is shifted right 4 bits and the most significant 4
bits are set with the results of (src2 <= src1), for each of the 4 fields.
The value produced by the operation is the result of 4 parallel minimum
operations, i.e. the updated z-buffer.

The long form, [p]fzchkl, does the same, except it uses 2 32-bit wide
fields, shifts PM by 2 bits, and updates the high 2 bits.  The shift
allows you to rapidly compute 8 bits worth of z-buffer values.
The size of the z-buffer is independent oof the pixel size set in the PS
field of the psr.

> Phong shading
[p]faddp	src1, src2, rdest (opcode 1010000)
This instruction does pixel interpolation into the MERGE register.
I don't quite understand how this instruction is useful, but it
does something unusual.  Assume 8-bit pixels specified in the PS field
of the PSR.

faddp takes src1 and src2 as consisting of 4 16-bit words, adds each
field together, and writes the high bytes of each word (if you consider
the words to be fixed-point 8.8 bit numbers, it writes the integer
parts) to the MERGE register.  The MERGE register has been shifted
down 8 bits at the same time, so two of these instructions will fill
it with pixel values.

If the pixels are 16 bits wide, it will do the same, except the fields
are considered to be 6.10 bit fixed-point numbers, with the high 6 bits
loaded into the MERGE regsiter, which has been shifted down 6 bits.
(After two shifts, two bits won't fit and get truncated from one of
the fields - thus the 6/6/4 RGB format you see flying around.  This is
the only place it appeears.)

If the pixels are 32 bits wide, the fields are taken to be 32 bits wide,
withe the high bytes of each of the two copied to the MERGE register, which
has been shifted down 8 bits.

There is also a similar [p]faddz instruction (opcode 1010001), which
does the same thing with 16.16 bit fields, shifting the MERGE register
16 bita at a time.  Intel seems to be really keen on this sort of
operation.  I wish I knew what it was good for.

You can do the same thing with 32.32 bit fields, by doing two long adds
on the corresponding parts of src1 and src2, then using a single-precision
move to copy the destination parts nito a register pair.

[p]form		src1, dest (opcode 1011010)
dest = src1 | MERGE
MERGE = 0
This instruction lets you read the MERGE register after you've pounded on
it a while, setting any last bits you need to tweak and clearing it
for future action.

> Move fp to int
fxfr		src1, dest (opcode 1000000)
This moves single-precision floating-point register src1 to integer
register dest.  The opposite of ixfr.  [These mnemonics aren't very mnemonic.]

>> Dual-Instruction Mode <<
One of the magic bits in each fp instruction is the D, dual-instruction
bit.  Intel suggests using either a d. prefix to the mnemonic or
assembler directives .dual and .enddual.

If the processor comes across an instruction (which must be aligned on a
64-bit boundary) with the D bit set, then it executes the next instruction
(integer ("core") op or fp op with D bit set) and starts reading instructions
64 bits at a time.  The low-order instruction must be an FP op, and the high-
order must be an integer ("core") op.  Exception: the fnop (lsrd r0, r0, r0)
instruction is allowed in the fp slot.  Both these instructions are
executed simultaneously.

To get out of dual-instruction mode, have an fp op (FLOP) without the D
bit set.  This pair, and the next, will still be executed in dual-instruction
mode, but after that you're back to single.  A degenerate case is a single
FLOP in a stream with the D bit set, followed by one with it clear.
The next two instructions will be executed as a pair, and them back to single
mode.

Executing two instructions at once requires some extra rules:

- If a branch on CC is paired with a floating-point compare, the branch tests
  CC before the compare sets it.
- If an ixfr, fld, or pfld instruction is paired with a FLOP, the FLOP
  gets the register value before the other instruction updates it (or
  marks it as pending in the scoreboard, really).
- An fst or pst operation that stores a register which is written to by
  the instruction it's paired with, the new value is stored to memory.
- An fxfr instruction that conflicts with a source operand in the
  core operation paired with it will store after the core op has read
  the register.  "The destination of the core operation will not be
  updated if it is any if the integer register.  Likewise, if the core
  instruction uses autoincrement addressing, the index register will not
  be updated."  Typo?  I think this meand the fxfr steals the write bus
  from the core processor, and the core processor's write goes to the
  bit bucket.
- If both instructions set the CC, the FLOP will win.

- If the FLOP is scalar and the core operation is fst or pst, it should
  not store the result of the FLOP.  When the core OP is pst, the FLOP
  must not be [p]fzchks or [p]fzchkl.  Conflict over the PM field, y'know.
- When the core op is ld.c or st.c (diddles control registers), it must
  be paired with fnop.
- You cannot use the return-from-interrupt functionality of bri in dual-
  instruction mode.
- A FLOP which sets CC cannot be paired with a compare-and-branch core
  instruction.  I.e. pfeq and pfgt conflict with b[n]c.t.  b[n]c.t
  also conflict with a pfeq or pfgt instruction in the next pair, too.
- "When the FLOP is fxfr, the core operation cannot be ld, ld.c, st, st.c,
  call, ixfr, or any instruction that updates an integer register
  (including autoincrement indexing)."

- You can't start to exit from dual-instruction mode on an instruction paired
  with a control-transfer instruction.  I.e. if the FLOP before had D set,
  so must the FLOP paired with the branch.
- You can't start to switch to or from dual-instruction mode on the instruction
  following a bri (in its delay slot).

Enough rules?  Well, you should have known it was gonna be a bit ugly.

>> Traps, Interrupts, Exceptions, etc. <<

As I mentioned, this is not well done.

When a trap ocurrs, bits are set in the psr (and maybe fpsr, if the FT bit
in the psr is set) to indicate contributing factors, and then the U and IM
bits are copied to the PU and PIM bits, then cleared (disabling interrupts
and switching to supervisor mode), the DIM and DS flags are set as needed,
and the fir is set up.

(In dual-instruction mode, the fir will point to the FLOP in the low-order
half of the pair.  If the problem was just a data-access fault, the FLOP
(unless it was fxfr) completes, and you should not reexecute it on
interrupt return.  Instruction and data-access faults are always the fault
of the core instruction.)

After this setup, the processor jumps to virtual address 0xFFFFFF00.
then you have to figure out what's going on and fix it.  The state of
the processor consists of:

- The register files
- The four pipelines
- The KI, KR, and T registers
- The MERGE register
- The psr, epsr, and fsr.
- The fir, and
- The dirbase register (with its dependencied on the data cache)

A simple interrupt return consists of
- Restoring the register files, pipelines, KI, KR, T, and MERGE registers
  (not necessary for simple interrupt handlers), except for one register
  which holds the return address from the fir.
- Undoing the effect of an autoincrement instruction which must be
  reexecuted (parse the instruction at [fir] to figure this out)
- See if you need to back up the return address by one instruction
- Set up the psr, possibly setting the KNF bit, and definitely setting
  at least one trap bit.
- Execute an indirect branch (bri) to do the interrupt return, and in its
  delay slot,
- Restore the register that holds the resumption address.  The processor
  is still in supervosor mode here, so you don't need to pollute the
  user's address space.

> Backing up the return address
If the instruction before the one pointed to by the fir is a delayed branch,
you should back up and re-execute it.  If it is a bla, you need to undo its
add instruction.

There is an exception to this where you bombed out on a floating-point
compare instruction you need to emulate and the instruction before is
a conditional delayed branch.  Here, you need to leave the CC alone so
the branch will do the right thing, and set it so the fp compare
will seem to have done the right thing.  You need to compute where the
conditional branch would put you and resume there.

If you are backing up, and in dual-instruction mode, you should set
things up (DS set, DIM clear) so the core instrucrion will be executed
in single-instruction mode, then DIM will be re-entered.  If DS was
originally set, clear it.

Plus, you have to worry about the case that the instruction at fir-4
might not exist.  Intel suggest that you begin each code segment with
a nop instruction to avoid this problem.

> Setting KNF
KNF should be set if you have emulated a floating-point instruction that
trapped, or if you got only a data-access fault in dual-instruction mode
and the FLOP was not fxfr.  [Is the perfectly clear?]

> Saving the pipeline
Doing this is messy.  Basically, you need to read out all the results
(and the associated error codes for the adder and multipler pipelines)
to store them, and then push operations with the equivalent answers back
on restore.  For the load pipeline, store the values read in memory
somewhere and reload it from there afterwards.  For the graphics pipeline,
you can just read it with a pfiadd, and restore it the same way (add 0
to the recalled value).  The MERGE register also needs to be stored.

For the floating-point pipelines, you need to get all the values out,
including error conditions, and the KI, KR, and T registers.  To put
them back, first stuff the KR, KI, and T registers, then place value+0
and value*1 computations into the various pipelines, along with the
proper error bits.  There's sample code to do this in the data book,
and it's not particualrly pretty.

>> Calling Conventions <<

Intel has a suggested calling convention.  Although the border is still
fuzzy, the manual suggests r0-r15 and f0-f15 as callee-saves, and the
other half as caller-saves.  r1 is the retrn address, r2 is the stack
pointer, and r3 is the frame pointer.  Parameters are passed in r16 through
r27 and f16 through f27, and the others are used for scratch.  r31 is reserved
for address computations.

They suggest that even single-precision float arguments be passed in a
register pair, and anything that won't fit into registers be passed
on the stack C-style.  The stack pointer should always be 16-byte
aligned so the 128-bit loads can be used easily.

> Memory map
They also suggest a memory map.
It starts with 4K of unreadable memory (NULL-catcher), then user data,
and heap.  Then empty space until you hit the stack, then shared-memory
frames, and OS data, topping out at 0xF0000000.  Then comes a jump table to
standard library routines until 0xF0400000, then user code (text), blank
space, and then the OS up at the top of memory.

>> Sample code <<

The manual gives a bunch of sample code.  I won't reproduce it, but will
list what's there:

- Sign-extending a value in a register (shl, shra)
- Loading unsigned integers (ld, and)
- single-precision FP divide (approximate, two iterations Newton-Raphson
  unpipelined, 22 cycles, 2 ulp worst-case error)
- DP fp divide (three iterations Newton-Raphson, also 2 ulp, 38 cycles)
- Integer multiply (move to fp, use fmlow, move back; 9 clocks, five
  of which can be overlapped)
- Signed int to double (7 cycles; 3 can be overlapped)
- Signed integer divide (62 cycles, 59 without remainder)
- Null-terminated string copy (byte-at-a-time, simple)
- Example of pipelined adds
- Example of pipelined multiply-accumulate
- Example of dual-instruction mode
- Cache strategies for matrix dot product (e.g. keep both matrices in
  cache; keep one and use pipleined loads on the other)

>> Pipeline Interlocks <<

Everything's single-cycle, but here's what can interlock:

i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss
in progree simultaneously.

d-cache miss (on load): again, pin timing, but it seems to be "clocks from
ADS asserted to READY# asserted"

fld miss: d-cache miss plus one clock

call,calli,ixfr,fxfr,ld,ld.c,st,st.c,pfld,fld or fst
with data cache miss in progress - stalls until miss satisfied, plus one cycle

ld, call, calli, fxfr and ld.c have 1 cycte of latency (next instruction
will stall if scoreboard hits)

fld, pfld and ixfdr have 2 cyctes of latency.

addu, adds, subu, subs, pfeq, pfgt, and pfle have 1 cycle of latency to
update the CC bit.  A branch on that bit will stall.

The multipler's src1 must be in the register file; if it is the result of
the previous instr, you get a 1-cycle stall.

Scalar FLOPS fadd, fix, fmlow, fmul.ss, fmul.sd, ftrunc and fsub have 3
cycles of latency.  fmul.dd has four.  If the input and output precisions
differ (e.g. fmul.sd), add one cycle.  Plus one if the following FLOP is
pipelined and has dest <> f0.

TLB miss takes 5 cycles plus two reads, plus setting the A bit (if necessary).

if three pfld's are outstanding and you execute one more, you will
stall until the first completes, plus one cycle

a pfld data-cache hit costs two clocks

if the store pipe is full (one on bus plus two pending internally), another
access will delay until the current access completes, plus one cycle

a load (or fld) following a store cache hit - one clock

delayed branch not taken - costs one clock

nondelayed branch taken - one clock for bc, bnc; two for bte, btne.

bri - one clock

st.c - two clocks

there is not forwarding from the graphics unit to the adder, multiplier,
or itself, so there is one cycle of latency there

a flush has two cycles of latency

an fst takes one cycle to get the value out of the register, so if the
next instruction overwrites the register being stored, it will stall

>> The End <<

And that, boys and girls, is basically the complete contents of the
programmer's reference manual.  Enjoy!

(52K ug... let's see if we can bomb any mailers!)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor