Xref: utzoo comp.arch:8629 comp.sys.intel:733 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!sun!pitstop!sundc!seismo!uunet!microsoft!w-colinp From: w-colinp@microsoft.UUCP (Colin Plumb) Newsgroups: comp.arch,comp.sys.intel Subject: Re: i860 overview (very very long) Keywords: i860, 80860, iapx860, timing, opcodes, instructions, menmonics, pipelines Message-ID: <808@microsoft.UUCP> Date: 6 Mar 89 14:00:56 GMT References: <807@microsoft.UUCP> Reply-To: w-colinp@microsoft.uucp (Colin Plumb) Organization: very little Lines: 1121 Well, I just got Jan's copy of the "i860 64-bit Microprocessor Programmer's Reference Manual" and am going to post an even longer summary. I'll try to avoid too much duplication. Personal flames: The exception handling is a disaster. On any sort of exception, the processor switches to supervisor mode and jumps to virtual address 0xFFFFFF00. Then you have to stare at the bits in the status register to figure out what happened, handle it, and do arcane things to get the processor in a state such that it can restart. This involves looking at the instruction that faulted and the one just before and parsing them a bit. Bleah! Since the processor doesn't handle denormalised, infinity, or NaN values in the floating-point unit, causing a trap, and the business of sticking the right value into the pipeline is so tedious, you basically need to avoid these things altogether. Also, interrupt return is wierd. It's overloaded onto the branch indirect instruction. If the status register indicates that you're inside a trap handler, it does some interrupt-return things in addition to branching to an address specified in a register. Integer divide is done by converting to floating point, doing the Newton- Raphson bit, and converting back. The sample code they give requires 62 clocks (59 without remainder). Can you say "divide step" boys and girls? The instruction in the delay slot of a control transfer must not be another control transfer instruction, including a trap. This makes me wonder if putting a trap there could sufficiently confuse the processor that I'd end up in my code in supervisor mode. No reason to believe so, just a nasty idea that popped into my head. (The first rule of root-crackers: look for something which says "do not do x." Try as many variations of x as possible.) System regsiters may be read in user mode, and writes are simply ignored. So much for virtual machines! Some floating-point instructions can't be pipelined; others must be. Annoying. This is Intel order number 240329-001, copyright Intel 1989. Other related documents: i860 64-bit Microprocessor (data sheet), order number 240296 i860 Microprocessor Assembler and Linker Reference Manual, 240436 i860 Microprocessor Simulator-Debugger Reference Manual, 240437 The manual I have says absolutely nothing about pinout, timing, or any such electrical thing. Anyway, on to the meat: >> Introduction and Register Summary << There are 32 32-bit integer registers and 32 32-bit fp registers. The fp registers are used in even/odd pairs for double-precision operations. The even register appears in memory at the lower address. Other registers are the psr (processor status register, 32 bits), and epsr (32 more bits), the db register (debugging, specifies an address to breakpoint; reads and writes can be trapped), the dirbase register (root pointer for page tables), fir (fault instruction register, saved PC on a fault), fsr (floating-point status register), three special-purpose registers (64 bits) for use in pipelined floating-point mode: KR, KI, and T, and a 64-bit MERGE register used in pixel operations. The psr bits, lowest to highest are: 0: BR, Break Read 1: BW, Break Write - these bits control breakpoints used with the db register. When one is set, the corresponding access to the address specified in the db register causes a trap. The db register specifies a byte address; any access touching that byte will be trapped. 2: CC, Condition Code - there is only one CC bit, set by the add and subtract instructions as a greater than/less-than flag, and by the logical (and, or, xor, andnot) instructions as a zero flag. 3: LCC, Loop Condition Code - this is used by the bla instruction only to do add-compare-and-branch type things. 4: IM, Interrupt Mode - external interrupt enable bit. 5: PIM, Previous Interrupt Mode - state of the IM bit before the last trap. 6: U, User - set in processor is in user mode. 7: PU, Previous User - copy of U bit as of before last trap. 8: IT, Instruction Trap - set by the processor when a trap ocurrs if the current instruction caused a trap. Breakpoints and the like. 9: IN, INterrupt - set when a trap ocurrs if an external interrupt is a contributing factor. 10: IAT, Instruction Access Trap - as above, set if there was an address translation problem during instruction fetch. There is no mention of a BERR-like pin. 11: DAT, Data Access Trap - as above, but for data accesses. This bit is also set by unaligned load/stires and BR/BW exceptions. 12: FT, Floating-point Trap - set if a floating-point error contributed to a trap. Note that any combination of the trap bits may be set on entry to the interrupt handler at 0xFFFFFF00. No bits set indicates reset/power-up. Multiple bits set indicates multiple simultaneous exceptions. 13: DS, Delayed Switch - there is a 2-cycle latency between the first instruction in the stream with the double-instruction mode but set and the processor starting to execute two instructions per cycle, or between the first instruction with the bit clear and the cessation of double-instruction mode. This bit is set when the first cycle of latency has passed, but not the second. Note that it is set for both switching to dual-instruction mode and away. The direction is given by the DIM bit. 14: DIM, Dual-Instruction Mode - set if the processor is in dual-instruction mode, executing an integer ("core") instruction and a floating-point one in a single cycle. It only set when a trap ocurrs; it does not reflect the current state of the processor, but the one before the trap. The same goes for the DS bit. 15: KNF, Kill Next Floating - on trap return, if this bit is set, the next floating-point instruction is ignored (except for its dual-instruction bit). Useful when emulating a floating-point instruction that trapped in dual-instruction mode, when you want to retry the "core" integer instruction but not the fp one. 16: X - Unused. Undefined when read, write with 0 or saved value. 17..21: SC, Shift Count - remembers the shift used in the last SHR (shift right logical) instruction. Specifies the shift count for the SHRD (shift right double - extract 32 bits from 64) instruction. Equivalent to the 29000's FC register. 23..22: PS, Pixel Size - specifies the size of a pixel for graphics operations. 0 through 3 mean 8, 16, 32, or bit pixels. 24..31: PM, Pixel Mask - the pixel store instruction stores those pixels in the current 64-bit word specified by the low-order bits of this field. These bits can be set by various z-buffer instructions. The PM, PS, SC, CC, and LCC fields can be set from user mode; writes to the other fields are ignored in user mode. The epsr bits, lowest to highest are: 0..7: Processor type - specifies the type of the current processor. 1 for the i860. (Hardwired, may not be changed even by supervisor) 8..12: Stepping number - specifies the revision. (Also hardwired) 13: IL, InterLock - set if a trap ocurrs in the middle of a lock/unlock sequence. 14: WP, Write Rpotect - if clear, supervisor-mode accesses ignore the write protect bit of a TLB entry. If set, even supervisor-mode writes are disallowed. 15..16: X, unused. 17: INT, INTerrupt - the value of the INT input pin. It looks like there is only one, and this bit is unqualified. Writes are ignored. 18..21: DCS, Data Cache Size - the size of the data cache. 2^(DCS+12) bytes. Currently 1, meaning 8Kbytes. Hardwired. 22: PBM, Page-table Bit Mode - determines which of two bits in the page table entry is reflected on the PTB pin. If 0, the CD bit; if 1, the WT bit. 23: BE, Big-Endian - set if the processor in in big-endian mode. Causes the low 3 bits of the address bus to be complemented. 24: OF, OverFlow - set or cleared by the add and subtract instructions if signed or unsigned (depending on the instruction) occurs. There is an instruction analogous to the 68000's TRAPV which traps if this bit is set. 25..31: X, unused. OF is user-writable; the other fields are only writeable from supervisor mode. The db, Data Breakpoint register contains a byte address which is watched for accesses. If any access touches this byte and the corresponding bit in the psr is set, a data access trap occurs. The dirbase, directory base register points to the root of the page table tree. Standard two-level page table with 4K pages. The bits, lowest to highes, are: 0: ATE, Address Translation Enable - if set, address translation is enabled. You must flush the data cahce before fiddling with this bit. 1..3: DPS, DRAM Page Size - the i860 has support for page-mode or static-column DRAMs. If two accesses differ only in the low-order 12+DPS bits, the NENE# pin is asserted. Zero is used for one bank of 256Kxn DRAMs. 4: BL, Bus Lock - echoed to the outside world on the LOCK# pin, after one cycle of latency. Controlled by the lock and unlock instructions. Copied to the IL bit of the epsr and cleared on a trap. 5: ITI, Instruction cache and TLB Invalidate - when a 1 is written, the instruction cache and page table cache are invalidated. Always reads as zero. 6: X, unused. 7: CS8, Code Size 8 bits - when set, instruction fetches are done from 8-bit-wide memory instead of 64. Used for bootstrapping from a ROM. Once cleared, cannot be reset. 8..9: RB, Replacement Block - can control which block (set) of a cache is replaced on a miss. Used by the data-cache flush instruction, and for testing in conjunction with the next field. For the data and instruction caches, which are 2-way set-associative, only the low bit is used. For the TLB, both bits are used. 10..11: RC, Replacement Control - the i860 normally uses random replacement on all of its caches. For testing, you can replace this with a deterministic algorithm. 00 is normal, 01 causes all cache replacements to use the set specified in the RB field, 10 causes the data cache to obey the RB field, and 11 disables data cache replacement. I think this means hits can still occur, but no new information will be added to the cache. 12..31: DTB, Directory Table Base - this, with 12 low bits of 0, is the address of the first-level page table. The fir (Fault Instruction Register) holds the (virtual, I think) address of the instruction that casued the trap. The first time it is read, it is unfrozen, and subsequent reads will just get the address of the load instruction. The fpsr contains all the floating-point flags: 0: FZ, Flush Zero - if set, underflow is flushed to zero instead of raising a result-exception trap. 1: TI, Trap Inexact - if set, inexact results cause a trap. 2..3: RM, Rounding Mode - 0 through 3 mean round towards nearest, -inf, +inf, and 0. 4: U, Update - always reads as zero; if set on a write of this register, bits 9 through 15 and 22 through 24 are written. If clear, the data written to them is ignored and they are unchanged. 5: FTE, Floating-point Trap Enable - if clear, floating-point traps are never reported. Used when mucking with the pipeline in various ways, and when sticking software-emulated values into the fp unit. 6: X, unused. 7: SI, Sticky Inexact - set whenever an inexact result is generated, regardless of the state of the TI bit. Cleared only by explicit write. 8: SE, Source Exception - set when one of the inputs to a FLOP is invalid (infinity, denormal, or NaN). 9: MU, Multiplier Underflow - on read, indicates the last multiply operation to come out of the pileline underflowed. On write (note only written if the U bit is set), this forces the flag on the operation in the first stage of the multiply pipleline. When that operation reaches the end of the multiply pileline, this is the MU bit that will come out. Used for reloading the pipeline. 10: MO, Multiplier Overflow - similar to above. 11: MI, Multiplier Inexact - similar to above. 12: MA, Multipler Add-one - similar to above, but indicates that the multipler rounded up instead of down. I'm not sure of the effects in the presence of sign bits. 13: AU - Adder Underflow - similar to MU. 14: AO - Adder Overflow - similar to MO. 15: AI - Adder Inexact - similar to MI. 16: AA - Adder Add-one - similar to MA. 17..21: RR, Result Register - holds the destination of a FLOP when an exception occurs due to a scalar op. 22..24: AE, Adder Exponent - holds the high 3 bits of the exponent coming out of the adder. Used to handle exceptions involving double-precision inputs and single-precision outputs properly. 25: X, unused. 26: LRP, Load pipe Result Precision - holds the precision of the value in the last stage of the load pipeline (more about this later); set on dp, clear on sp. It cannot be set by software (except by stuffing things into the pipe) and is provided to help save the state of the pipe. 27: IRP, Integer pipe Result Precision - holds the precision of the value in the graphics pipeline. 28: MRP, Multiplier Result Precision - holds the precision of the value in the last stage of the multipler pipeline. 29: MRP, Adder Result Precision - holds the precision of the value in the last stage of the adder pipeline. 30..31: X, unused. The KI and KR registers hold constant inputs to the multiplier for use in the multiply-accumulate instructions. The T register is between the multiplier and the adder, again for multiply-accumulate instructions. The MERGE register is used by graphics operations. The bits in a page table entry (on either level) are: 0: P, Present - if clear, the entry is invalid and the high 31 bits are available to the programmer. 1: W, Writeable - if set, the page is writeable. This is always enforced in user mode, and enforcement in supervisor mode is controlled by the WP bit of the epsr. The effective value of this bit for any page table entry is the AND of the bits at the two levels of page tables. 2: U, User - if set, this page is accessible at user level. Must be set at both levels of page table for the page to be accessible. 3: WT, Write-Through - if set, this page will not be cached by the internal data cache. This bit can also be echoed externally, of the PBM bit of the espr is set. This bit is only used on the second (lower) level of page tables. On the first level, it is reserved. 4: CD, Cache Disable - if set, this page will not be cached by either internal cache. This bit can also be echoed externally, of the PBM bit of the espr is clear. This bit is only used on the second (lower) level of page tables. On the first level, it is reserved. 5: A, Accessed - only used in the second level of page tables, and is set whenever the page is loaded into the TLB. 6: D, Dirty - only used on the second level of page tables. If clear, and the page is being written to, a data access fault is generated. (I.e. must be maintained by software.) 7..8: X, unused. 9..11: Reserved for use by the OS. 12..31: high-order bits of the physical address of the page or next level of page tables. There is an external pin, KEN#, that must be asserted to enable cacheing of instructions and/or data. If not asserted, the data is not put into the cache. The data cache must be explicitly flushed by software before you may change page tables. >> Integer ("core") instructions << There are a few instruction formats. The most common has, high to low, 6 bits of opcode (of which the low bit, bit 25, is usually clear), 5 bits of src2, 5 bits of dest, 5 bits of src1, and 11 bits of immediate offset. Call this format A. Next most common has 6 bits of opcode (low bit usually set), 5 bits of src2, 5 bits of dest, and 16 bits of immediate constant (frequently ignored; set to 0). Call this format B. A few instructions have the high 5 bits of a 16-bit immediate offset in the dest field (bits 16..20) rather than the src1 field. Store and a few branch instructions. Call this format C. A few instructions are like the above, and also have a 5-bit immediate constant in the src1 field. Call this format D. The largest branch offsets are handled by instructions having 6 bits of opcode and 26 bits of (signed) offset. Call this format E. Floating-point instructions have 010010 in the high 6 bits, 5 bits of src2, 5 bits of dest, 5 bits of src1, 4 magic bits (P - pipeline, D - dual-instruction mode, S - source precision, and R - result precision), and 7 bits of opcode. Call this format F. Some special operations have 010011 in the 6-bit opcode field, 5 bits of src1 in bits 11..15, and 5 bits of opcode in bits 0..4. The other bits are unused. Call this format G. > Load The load instruction (ld) has 3 variants: ld.b, ld.s, and ld.l. dest = mem[src1 + src2]. The opcode is 000L0I. I controls src1. 0 if it's a register (format A), 1 if it's a 16-bit signed offset (format B). L is 0 if it's a byte load, 1 if it's a 16 or 32-bit load. In the latter case, bit 0 of the instruction is stolen from the immediate offset and indicates 16 (0) ir 32 (1) bit loads. This bit is connsidered to be 0 for the addition. Loads are always sign-extended. BTW, the Intel-suggested format is ld.x src1(src2), dest. No more mov dest, src nonsense! There is 1 cycle of latency on loads, even if they hit the cache. This is interlocked. > Store Store is similar (st.b, st.s. or st.l), but it must use a 16-bit immediate offset, and uses format C. mem[src2 + immediate] = src1. The opcode is 000L11. (Again, st.x src1, imm(src2).) Note that r0 hardwired to zero comes in handy here. 32-bit absolute addresses must be formed by loading the high 16 bits into a register and taking an offset from there. Intel suggests r32 for this purpose. Because the offsets are signed, you may have to diddle the high 16 bits a bit to make things work properly. > Move int to fp ixfr src, fdest moves 32 bits (nor format conversion) from an integer to an fp register. The opcode is 000010. There are two cycles of latency (interlocked) until the data appears in the destination. Format A. 000110 is reserved. > Fp load fld.y src1(src2), fdest and fld.y src1(src2)++, fdest are floating-point load instructions. They are simialr to the integer load instructions, except the data sizes are .l (32 bits), .d (64), or .q (128 bits). The ++ autoincrement mode stores src1+src2 back into src2. Again, the low-order bits of the immediate offsset (formats A or B) are used in the instruction. Bit 0 is set for autoincrement addressing, bit 1 is set for 32-bit loads, and bit 2 is set for 128-bit loads. Bits 1 and 2 are zero for 64-bit loads. The opcode is 00100I, I selecting format A or B. For 64 or 128-bit loads, the destination register must be even or a multiple of 4. > Fp store fst.y fdest, src1(src2) and fst.y fdest, src1(src2)++ are similar. src1 can be a register here. The opcode is 00101I. > Pipelined fp load pfld.z fdest, src1(src2)[++], pipelined load, has the same addressing modes, except they use the 3-deep load pipeline (which operates independently of scalar loads; the two may be arbitrarily interleaved). The destination register specifies where to put the result of the 3rd previous pfld instruction. 128-bit pipelined loads are not allowed. Pipelined accesses do not place the data in the cache, although they do handle cache hits properly. The opcode is 01100I, I selecting format A or B. > Pixel store pst.d freg, imm(src2)[++] stores the pixels specified by the PS field of the psr from the 64-bit freg into memory. If you have 16-bit pixels and the low bits of the PS field are 0110, only the middle 4 byte strobes will be asserted on write. Bit 0 corresponds to the lowest address. See the fzchks ahd fzchkl instructions for uses for this. The opcode is 001111, and it uses format B. After the store, the PM field is shifted right by the number of bits used, so the next pst instruction will have access to the next bits. > Add, Subtract, Compare The add/subtract instructions also double as compare instructions. There's addu src1, src2, dest, adds, subu, and subs. They do the obvious things, also setting the OF flag on overflow, and setting the CC flag as follows: addu: CC gets the carry from bit 31. adds: CC gets (src2 < -src1) subu: CC getsthe carry from bit 31 (src2 <= src1) subs: CC gets (src2 > src1) This uses formats A and B. If the 16-bit immediate is used, it is sign- extended. To get the one's complement, use subs -1, src2, dest. The opcode is 100UAI, where U is 0 for unsigned, 1 for signed; A is 0 for add, 1 for subtract, and I selects format A or B. (0 for A, 1 for B.) > Shift, Rotate The shift instructions are shl, shr, shra, and shrd. The first three do the obvious things, shr zero-filling high-order bits and shra sign-extending. shr *only* copies its src1 (register or 16-bit immediate, with the high 11 bits ignored) to the SC field of the PSR. shrd uses this to compute dest = ((src1<<32 + src2) >> SC ) & 0xFFFFFFFF. The opcdes are: 10100I for shl, I selects format A or B. 10101I for shr, I selects format A or B. 10111I for shra, I selects format A or B. 101100 for shrd, format A only. None of the shifts set the condition code bit, so the assembler uses the following macros: mov src2, dest == shl r0, src2, rdest nop == shl r0, r0, r0 fnop = shrd r0, r0, r0 To do a rotate, shr count, r0, r0 and shrd src, src, dest. > Trap There is a trap instruction, which uses format B, although the source operands are not interpreted. The destination register is "undefined," so it's a good idea to use register 0. The opcode is 010001. This causes an IT trap (see the psr). The source bits can be used for whatever. > And, Or, Xor, Andnot There are 4 logical instructions, and, or, xor, and andnot. They do the obvious things, dest = src1 & src2, dest = src1 | src2, dest = src1 ^ src2, dest = ~src1 & src2. The opcodes are of the form 11OPHI, where OP specifies one of {and, andnot, or, xor}. I specifies A or B format, and H can be set in B format to indicate that the immediate constant should be shifted up 16 bits before use. Thus, to load the high 16 bits of a register, orh immediate, r0, dest will do the trick. Or xor. H bit set and I bit clear is reserved. 16-bit immediate values are zero-extended, if used. The CC flag is set if the result is zero, otherwise it is cleared. The opcodes with the H bit set are andh, andnoth, orh, and xorh. > Control register modification ld.c and st.c are used to modify control registers. Format A is used, although only the src2 one of src1 (for st.c) and dest (for ld.c) are interpreted. The opcode is 0011L0, where L is 0 for load (dest = special[src2]) and 1 for store (special[src2] = src1). The src2 field holds 0 through 5 for the fir, psr, dirbase, db, fpsr, and epsr registers. These instructions are legal in user mode, although many writes will be ignored. > Branches Most of the branch instructions use format E, taking a 26-bit offset. I assume the offset is a word offset (in case I forgot to mention it, instructions must be 32-bit word-aligned), but I can't find it explicitly stated. br (opcode 011010) is a straight branch, with 1 delay slot. > Call call (opcode 011011) is similar, but also puts a return address in register r1. bc and bnc branch if the CC flag is set or clear, respectively. They come in non-delayed (bc, bnc) and delayed (bc.t, bnc.t) versions. The opcodes are 01110T and 01111T, respectively. T is set in the .t (delayed) forms. bri is an indirect branch, delayed, using opcode 010001 and format A, I think, although only src1 is used. bri [src1] branches to the address specified in src1. The low two bits of src1 are ignored. If any of the trap bits are set when this instruction is executed (see the psr), this also performs an interrupt return, clearing the trap bits, copying PU to U and PIM to IM, and doing strange things with DS and DIM. > Loop There's also bla, a looping-type instruction. It's a bit wierd. First of all, if the LCC flag is set, it does a delayed branch with a 16-bit offset, then it computes "adds src1, src2, src2" and sets the LCC flag to what the complement of the CC flag would be for a real adds. This uses format C, with opcode 101101. Intel gives the example of clearing an array of 16 single-precision numbers to zero, atarting at the address in r4: adds -1, r0, r5 // r5 holds loop increment or 15, r0, r6 // r6 holds loop count bla r5, r6, CLEAR_LOOP // clear LCC; it doesn't matter // if we jump or not addu -4, r4, r4 // compensate for preincrement (delay slot) CLEAR_LOOP: bla r5, r6, CLEAR_LOOP fst.l f0, 4(r4)++ // delay slot I've never seen a looping instruction quite like it. Be careful not to trash LCC during the loop (shades of jcxz!). Other core instructions: > Compare-and-branch bte src1, src2, offset and btne src1, src2, offset branch (no delay) if src1 == src2 or src1 != src2, respectively. They have opcode 0101EI, where E = 1 branches on equal, and I selects format C or D. I.e. src1 can be an immediate value, but only in the range 0..31. > Flush flush - flush the data cache. "In user mode, execution of flush is suppressed," whatever that means. What it seems to do is force a fake load, one that fills the data cache with garbage. "When flushing the cache before a task switch, the addresses used by the flush instruction should reference non-user- accessible memory to ensure [Will wonders never cease? A book written in the U.S. actually got ensure/insure straight!] that cached data from the old task is [oh, well... can't win them all] not transferred to the new task. These addresses must be valid and writeable in both the old and the new tasks's space." The sample code reserves a 4K hunk of memory, and does this: // Rw, Rx, Ry, and Rz are registers // FLUSH_P_H and FLUSH_P_L are two halves of the address of the 4K hunk, // less 32. ld.c dirbase, Rz // assuming RB and RC fields clear or 0x800, Rz, Rz // Set RC field to 2 (obey RB for data cache) adds -1, r0, Rx // Loop increment call D_FLUSH st.c Rz, dirbase // Store new RC field (in delay slot of call!) or 0x900, Rz, Rz // Set RB field to 2 (was assumed 0) call D_FLUSH st.c Rz, dirbase // Store new RB field (in delay slot of call!) xor 0x900, Rz, Rz // Clear RB and RC fields // Pound on DTB, ATE, or ITI fields here st.c Rz, dirbase // Store cleared values // continue... D_FLUSH: orh FLUSH_P_H, r0, Rw or FLUSH_P_L, r0, Rw // Rw gets address of flush area or 127, r0, Ry // loop counter bla Rx, Ry, D_FLUSH_LOOP // set up LCC ld.l 32(Rw), r0 // clear pending bus writes D_FLUSH_LOOP: bla Rx, Ry, D_FLUSH_LOOP // Loop flush 32(Rw)++ // Hit every 32 bytes (cache line size) bri r1 // Return - branch to (r1) ld.l -512(Rw), r0 // Load from flush area to clear pending // writes (guaranteed cache hit). I don't quite understand the bit about clearing pending writes. I guess it puts off address translation until the last possible moment (the write queue uses virtual addresses), and a load to r0 is an idiom which always generates an interlock. The flush instruction uses opcode 001101; format B. Bit 0 of the immediate field selects autoincrement mode. That's everything in formats A through E; now for format G. (High 6 bits opcode = 010011, low 5 bits give secondary opcode; only one 5-bit register field defined.) The defined operations are: calli: opcode 00010, performs an indirect (delayed) call via the address specified in the register operand. I don't know if it reads the source register before or after storing the return address in r1. Could be a way to play with coroutines. intovr: opcode 00100, traps if the OF flag in the espr is set. trapv. > Lock lock: opcode 00001. This is interesting. This begins an interlocked sequence on the next data access that misses the cache, setting the BL bit. Interrupts are disabled and the bus is locked until explicitly unlocked. The sequence must be restartable from the lock instruction in case a trap ocurrs. If there is more than one store, you must ensure there are no traps after the first non-idempotent store. I.e. keep the code on one page and make sure all the data addresses are valid. There is a similar unlock instruction (opcode 00111), that unlocks the bus on the first data access that misses the cache after it. These instructions *are* executable from user mode, but there is a 32- instruction counter that traps if you spend too long with the bus locked. I like those instructions. A RTOS might like to be able to set the timeout, but 32 instructions is a reasonable value. Now, for the interesting part: >> Floating-Point << These are all in the F format, with a 010010 opcode in the high 6 bits, then 5 bits of src2, dest, and src1, then 4 magic bits, then 7 bits of fp opcode. Two of the magic bits control the source and destination precisions. S=0 for single and S=1 for double sources. R=0 for single and R=0 for double results. > Pipelines Here comes time to explain the pipeline concepts used by the 80860. There are 4 pipelines on the i860: multiplier, adder, graphics unit, and floating-point loads. These are 2/3, 3, 1, and 3 stages deep. The multiplier is 2 stages deep for double-precision sources and 3 stages [sic] for single. The destination format is unimportant. The FZ (flush zero), RM (rounding mode) and/or RR (result register) bits of the fsr while there are results in the adder or multiplier pipelines is a bad idea. One of the magic bits in each fp instruction is the P, pipeline bit. If this bit is clear, the operation goes straight through the floating-point unit. Any results in the pipeline are lost, but the result is available by the next instruction. This is *not* the next cycle, but it's scoreborarded. (This doesn't apply to the load pipeline, which is not used by scalar load instructions.) If the pipeline bit is set, though, then the specified dest is for the result at the end of the pipeline and the requested operation goes in the front. The store is completed before the load of the source operands. (At least conceptually.) So initially, you must stick a few operations into the pipeline, throwing away whatever was there (writing it to f0), then you can pump through lots of data, then you have to stick in a few junk computations to get the last few results. The load pipeline, the pfld instruction, is the most straightforward, and works as described above. On the multiply pipeline, if you switch source precisions with the pipeline half-full, if you started out in double (2-stage) mode with B and A in the pipeline (A one stage from completion, B two), and added single-precision computation C, you'd store A and end up with C, B and 0.0 in the pipeline. If you started out with C, B, A, and added double-precision computation D, you'd end up with A stored and D, C in the pipeline. B would get lost. Both inputs to an operation must be of the same precision. There are odd, not fully explained problems with taking double source operands and returning a single result, so the precision suffixes on floating-point operations should generally be restricted to .ss, .sd, and .dd. > Fmul, Fadd, Fsub Anyway, here's a list of the simple floating-point operations: [p]fmul.p src1, src2, dest (opcode 0100000) [p]fadd.p src1, src2, dest (opcode 0110000) [p]fsub.p src1, src2, dest (opcode 0110001) // result = src1 - src2 The fadd or pfadd instruction may have a .ds precision suffix, as long as one of the sources is f0. This is used for format conversion. The [p]fadd instructions are used in the [p]fmov macros. > Float to integer [p]ftrunc.p src1, dest (opcode 0111010) The result of this operation is 64 bits, whose low 32 bits are the integer (truncated) part of the floating-point src1. It uses the adder. [p]fix.p src1, dest (opcode 0110010) Same as a bove, but the integer part is rounded. For both of these, the integer is two's complement, signed. pfmul3.dd src1, src2, dest (opcode 0100100) This forces a dp multiply to use the 3-stage pipeline. It's only intended for reloading a pipeline. > Multiplty (integer) fmlow.dd src1, src2, dest (opcode 0100001) This multiplies only the low-order bits of its operands. dest gets the low-order 53 bits of the product of the significands of src1 and src2. Bit 53 of dest gets the MSB of the product. This instruction cannot be used in pipelined mode, does not affect the result-status bits in the fpsr, and does not cause any traps. > Divide, Reciprocal frcp.p src2, dest (opcode 0100010) dest = 1/src2, approximately. Absolute significand error < 2^-7. src1 must be zero. Use as a starting point for Newton-Raphson. This instruction may not be pipelined. It causes a source-exception trap if src2 is zero. It uses the multipler. > Square root frsqr.p src2, dest (opcode 0100011) As above, but dest = 1/sqrt(src2), approximately, and it also traps if src2 < 0. > Fcmp pfgt.p src1, src2, dest (opcode 0110100, R bit clear) pfle.p src1, src2, dest (opcode 0110100, R bit set) pfeq.p src1, src2, dest (opcode 0110101) These instructions perform floating-ponit comparison using the adder. They begin with "p" because they advance the pipeline one stage (the value they insert is undefined, but not an error), but they place the result of the comparison (src1 > src2, src1 <= src2, src1 = srcs) in the CC bit immediately There is no pipeline delay. (Actually, there is one cycle of latency, but it's scoreboarded.) They do trap on invalid inputs. > Multiply-accumulate The following instructions are called dual-operation instructions, since they use both the adder and multiplier. Not to be confused with dual- instruction mode. Combining both of these gives the calimed 150 MOPS. pfam.p src1, src2, dest (opcode 000xxxx) pfmam.p src1, src2, dest (opcode 000xxxx) pfsm.p src1, src2, dest (opcode 001xxxx) pfmsm.p src1, src2, dest (opcode 001xxxx) These instructions are really complex families of instructions. They perform variations on multiply-accumulate. The xxxx is the DPC (Data-Path Control) field. The precision specifies the input and output precisions of the multiplier; the adder takes inputs and putputs of the destination precision. Here is where the KI, KR, and T registers come in. The possibe data flows are complex, but: The value written into dest can be the result of either the adder or multiplier pipeline. The multipler's src1 can be the instruction's src1, KI, or KR. If it is one of the K registers, the instruction's src1 can be copied into it prior to use, or you can use it's current value. The multiplier's src2 can be the given src2 or the value written into dest. The multipler's result can be written into the T register as well as sent to the destination register. The adder's src1 can be the instruction's src1 (if the multiplier hasn't usurped it), the value written into dest (again, if nobody else has it), or the value in the T register (which can be whatever it used to be or the value written by the multiplier). The adder's src2 can be the result of the multipler, the value written into the dest register, ot the given src2 (assuming the multipler hasn't stolen it). When you add the fact that the adder can compute src1+src2 or src1-src2, you have a total of 64 possibilities. A bit in the opcode specifies whether the adder adds or subtracts, and the P bit is used to specify which output goes to the dest register (0 = adder, 1 = multiplier (and the adder's result is thrown away)). After this factoring, there are 16 cases, 8 can be represented by the DPC field values 0XYX, where: X controls whether "K" means KR (X=0) or KI (X=1), Y controls whether the adder's src2 is the result of the multiplier (Y=0) or the result of the multiplier goes into T and the adders' src2 is the result that gets written into the dest register (Y=1), and Z controls whether the instruction's src1 goes to the adder's src1 (Z=0) or the instruction's src1 goes to K (and thence to the multiplier) and the adder's src1 comes from T (which may have come from the multiplier). DPC values of the form 1XY0 cover cases where the multiplier's inputs are K and the result written to dest (K is controlled by X, as above) and the adder's inputs are the instruction's src1 and src2. Y controls whether T is loaded with the result of the multiplier (Y=0) or not (Y=1). DPC values of the form 1XY1 cover cases where the multiplier's inputs are the instruction's src1 and src1. If X is 1, then the adder's src1 is T (which is not loaded from the multiplier's result) and Y controls whether the adder's src2 is the multiplier's result or the value written to the dest register. (Note that these may be the same value.) If X is 0, then the adder's second input is the result of the multiplier (which is not written into T), and its first input is controlled by Y. If 0, it's the valuee written into the dest register; if 1, it's the T register. Are you suitably confused? Pictures do help somewhat. Intel supplies transliteration rules for producing mnemonics from these various connections, but I won't go into them here. Scoreboard alert: when the multiplier's src1 is the instruction's src1, this must not be the same as rdest. Something screws up. >> Graphics operations << These also use the fp instruction encoding and register set. But they use a separate graphics pipeline which is only one stage deep - i.e. when you start one instruction, you get the result of the previous one out. As with the floating-point instructions, most have pipelined and non- pielined versions, which behave analogously. (The graphics operations use fp opcodes 1xxxxxx; I've already covered everything of the form 0xxxxxx.) > Long long The basic ones are long-integer operations: [p]fiadd.w src1, src2, dest (opcode 1001001) .w is .ss or .dd for 32 or 64-bit adds. The CC is not set, and no traps are signalled. [p]fisub.w src1, src2, dest (opcode 1001101) dest = src1-src2 There are move macros that use these instructions with f0. > Z-buffer [p]fzchks src1, src2, dest (opcode 1011111) [p]fzchkl src1, src2, dest (opcode 1011011) These instructions do z-buffer operations. The short form takes the sources as 4 fields of 16 bits each, and does 4 simultaneous compares, with the results written to the PM (Pixel Mask) field of the psr. In fact, what happens is that the PM is shifted right 4 bits and the most significant 4 bits are set with the results of (src2 <= src1), for each of the 4 fields. The value produced by the operation is the result of 4 parallel minimum operations, i.e. the updated z-buffer. The long form, [p]fzchkl, does the same, except it uses 2 32-bit wide fields, shifts PM by 2 bits, and updates the high 2 bits. The shift allows you to rapidly compute 8 bits worth of z-buffer values. The size of the z-buffer is independent oof the pixel size set in the PS field of the psr. > Phong shading [p]faddp src1, src2, rdest (opcode 1010000) This instruction does pixel interpolation into the MERGE register. I don't quite understand how this instruction is useful, but it does something unusual. Assume 8-bit pixels specified in the PS field of the PSR. faddp takes src1 and src2 as consisting of 4 16-bit words, adds each field together, and writes the high bytes of each word (if you consider the words to be fixed-point 8.8 bit numbers, it writes the integer parts) to the MERGE register. The MERGE register has been shifted down 8 bits at the same time, so two of these instructions will fill it with pixel values. If the pixels are 16 bits wide, it will do the same, except the fields are considered to be 6.10 bit fixed-point numbers, with the high 6 bits loaded into the MERGE regsiter, which has been shifted down 6 bits. (After two shifts, two bits won't fit and get truncated from one of the fields - thus the 6/6/4 RGB format you see flying around. This is the only place it appeears.) If the pixels are 32 bits wide, the fields are taken to be 32 bits wide, withe the high bytes of each of the two copied to the MERGE register, which has been shifted down 8 bits. There is also a similar [p]faddz instruction (opcode 1010001), which does the same thing with 16.16 bit fields, shifting the MERGE register 16 bita at a time. Intel seems to be really keen on this sort of operation. I wish I knew what it was good for. You can do the same thing with 32.32 bit fields, by doing two long adds on the corresponding parts of src1 and src2, then using a single-precision move to copy the destination parts nito a register pair. [p]form src1, dest (opcode 1011010) dest = src1 | MERGE MERGE = 0 This instruction lets you read the MERGE register after you've pounded on it a while, setting any last bits you need to tweak and clearing it for future action. > Move fp to int fxfr src1, dest (opcode 1000000) This moves single-precision floating-point register src1 to integer register dest. The opposite of ixfr. [These mnemonics aren't very mnemonic.] >> Dual-Instruction Mode << One of the magic bits in each fp instruction is the D, dual-instruction bit. Intel suggests using either a d. prefix to the mnemonic or assembler directives .dual and .enddual. If the processor comes across an instruction (which must be aligned on a 64-bit boundary) with the D bit set, then it executes the next instruction (integer ("core") op or fp op with D bit set) and starts reading instructions 64 bits at a time. The low-order instruction must be an FP op, and the high- order must be an integer ("core") op. Exception: the fnop (lsrd r0, r0, r0) instruction is allowed in the fp slot. Both these instructions are executed simultaneously. To get out of dual-instruction mode, have an fp op (FLOP) without the D bit set. This pair, and the next, will still be executed in dual-instruction mode, but after that you're back to single. A degenerate case is a single FLOP in a stream with the D bit set, followed by one with it clear. The next two instructions will be executed as a pair, and them back to single mode. Executing two instructions at once requires some extra rules: - If a branch on CC is paired with a floating-point compare, the branch tests CC before the compare sets it. - If an ixfr, fld, or pfld instruction is paired with a FLOP, the FLOP gets the register value before the other instruction updates it (or marks it as pending in the scoreboard, really). - An fst or pst operation that stores a register which is written to by the instruction it's paired with, the new value is stored to memory. - An fxfr instruction that conflicts with a source operand in the core operation paired with it will store after the core op has read the register. "The destination of the core operation will not be updated if it is any if the integer register. Likewise, if the core instruction uses autoincrement addressing, the index register will not be updated." Typo? I think this meand the fxfr steals the write bus from the core processor, and the core processor's write goes to the bit bucket. - If both instructions set the CC, the FLOP will win. - If the FLOP is scalar and the core operation is fst or pst, it should not store the result of the FLOP. When the core OP is pst, the FLOP must not be [p]fzchks or [p]fzchkl. Conflict over the PM field, y'know. - When the core op is ld.c or st.c (diddles control registers), it must be paired with fnop. - You cannot use the return-from-interrupt functionality of bri in dual- instruction mode. - A FLOP which sets CC cannot be paired with a compare-and-branch core instruction. I.e. pfeq and pfgt conflict with b[n]c.t. b[n]c.t also conflict with a pfeq or pfgt instruction in the next pair, too. - "When the FLOP is fxfr, the core operation cannot be ld, ld.c, st, st.c, call, ixfr, or any instruction that updates an integer register (including autoincrement indexing)." - You can't start to exit from dual-instruction mode on an instruction paired with a control-transfer instruction. I.e. if the FLOP before had D set, so must the FLOP paired with the branch. - You can't start to switch to or from dual-instruction mode on the instruction following a bri (in its delay slot). Enough rules? Well, you should have known it was gonna be a bit ugly. >> Traps, Interrupts, Exceptions, etc. << As I mentioned, this is not well done. When a trap ocurrs, bits are set in the psr (and maybe fpsr, if the FT bit in the psr is set) to indicate contributing factors, and then the U and IM bits are copied to the PU and PIM bits, then cleared (disabling interrupts and switching to supervisor mode), the DIM and DS flags are set as needed, and the fir is set up. (In dual-instruction mode, the fir will point to the FLOP in the low-order half of the pair. If the problem was just a data-access fault, the FLOP (unless it was fxfr) completes, and you should not reexecute it on interrupt return. Instruction and data-access faults are always the fault of the core instruction.) After this setup, the processor jumps to virtual address 0xFFFFFF00. then you have to figure out what's going on and fix it. The state of the processor consists of: - The register files - The four pipelines - The KI, KR, and T registers - The MERGE register - The psr, epsr, and fsr. - The fir, and - The dirbase register (with its dependencied on the data cache) A simple interrupt return consists of - Restoring the register files, pipelines, KI, KR, T, and MERGE registers (not necessary for simple interrupt handlers), except for one register which holds the return address from the fir. - Undoing the effect of an autoincrement instruction which must be reexecuted (parse the instruction at [fir] to figure this out) - See if you need to back up the return address by one instruction - Set up the psr, possibly setting the KNF bit, and definitely setting at least one trap bit. - Execute an indirect branch (bri) to do the interrupt return, and in its delay slot, - Restore the register that holds the resumption address. The processor is still in supervosor mode here, so you don't need to pollute the user's address space. > Backing up the return address If the instruction before the one pointed to by the fir is a delayed branch, you should back up and re-execute it. If it is a bla, you need to undo its add instruction. There is an exception to this where you bombed out on a floating-point compare instruction you need to emulate and the instruction before is a conditional delayed branch. Here, you need to leave the CC alone so the branch will do the right thing, and set it so the fp compare will seem to have done the right thing. You need to compute where the conditional branch would put you and resume there. If you are backing up, and in dual-instruction mode, you should set things up (DS set, DIM clear) so the core instrucrion will be executed in single-instruction mode, then DIM will be re-entered. If DS was originally set, clear it. Plus, you have to worry about the case that the instruction at fir-4 might not exist. Intel suggest that you begin each code segment with a nop instruction to avoid this problem. > Setting KNF KNF should be set if you have emulated a floating-point instruction that trapped, or if you got only a data-access fault in dual-instruction mode and the FLOP was not fxfr. [Is the perfectly clear?] > Saving the pipeline Doing this is messy. Basically, you need to read out all the results (and the associated error codes for the adder and multipler pipelines) to store them, and then push operations with the equivalent answers back on restore. For the load pipeline, store the values read in memory somewhere and reload it from there afterwards. For the graphics pipeline, you can just read it with a pfiadd, and restore it the same way (add 0 to the recalled value). The MERGE register also needs to be stored. For the floating-point pipelines, you need to get all the values out, including error conditions, and the KI, KR, and T registers. To put them back, first stuff the KR, KI, and T registers, then place value+0 and value*1 computations into the various pipelines, along with the proper error bits. There's sample code to do this in the data book, and it's not particualrly pretty. >> Calling Conventions << Intel has a suggested calling convention. Although the border is still fuzzy, the manual suggests r0-r15 and f0-f15 as callee-saves, and the other half as caller-saves. r1 is the retrn address, r2 is the stack pointer, and r3 is the frame pointer. Parameters are passed in r16 through r27 and f16 through f27, and the others are used for scratch. r31 is reserved for address computations. They suggest that even single-precision float arguments be passed in a register pair, and anything that won't fit into registers be passed on the stack C-style. The stack pointer should always be 16-byte aligned so the 128-bit loads can be used easily. > Memory map They also suggest a memory map. It starts with 4K of unreadable memory (NULL-catcher), then user data, and heap. Then empty space until you hit the stack, then shared-memory frames, and OS data, topping out at 0xF0000000. Then comes a jump table to standard library routines until 0xF0400000, then user code (text), blank space, and then the OS up at the top of memory. >> Sample code << The manual gives a bunch of sample code. I won't reproduce it, but will list what's there: - Sign-extending a value in a register (shl, shra) - Loading unsigned integers (ld, and) - single-precision FP divide (approximate, two iterations Newton-Raphson unpipelined, 22 cycles, 2 ulp worst-case error) - DP fp divide (three iterations Newton-Raphson, also 2 ulp, 38 cycles) - Integer multiply (move to fp, use fmlow, move back; 9 clocks, five of which can be overlapped) - Signed int to double (7 cycles; 3 can be overlapped) - Signed integer divide (62 cycles, 59 without remainder) - Null-terminated string copy (byte-at-a-time, simple) - Example of pipelined adds - Example of pipelined multiply-accumulate - Example of dual-instruction mode - Cache strategies for matrix dot product (e.g. keep both matrices in cache; keep one and use pipleined loads on the other) >> Pipeline Interlocks << Everything's single-cycle, but here's what can interlock: i-cache miss: given in terms of pin timing, plus two cycles if d-cache miss in progree simultaneously. d-cache miss (on load): again, pin timing, but it seems to be "clocks from ADS asserted to READY# asserted" fld miss: d-cache miss plus one clock call,calli,ixfr,fxfr,ld,ld.c,st,st.c,pfld,fld or fst with data cache miss in progress - stalls until miss satisfied, plus one cycle ld, call, calli, fxfr and ld.c have 1 cycte of latency (next instruction will stall if scoreboard hits) fld, pfld and ixfdr have 2 cyctes of latency. addu, adds, subu, subs, pfeq, pfgt, and pfle have 1 cycle of latency to update the CC bit. A branch on that bit will stall. The multipler's src1 must be in the register file; if it is the result of the previous instr, you get a 1-cycle stall. Scalar FLOPS fadd, fix, fmlow, fmul.ss, fmul.sd, ftrunc and fsub have 3 cycles of latency. fmul.dd has four. If the input and output precisions differ (e.g. fmul.sd), add one cycle. Plus one if the following FLOP is pipelined and has dest <> f0. TLB miss takes 5 cycles plus two reads, plus setting the A bit (if necessary). if three pfld's are outstanding and you execute one more, you will stall until the first completes, plus one cycle a pfld data-cache hit costs two clocks if the store pipe is full (one on bus plus two pending internally), another access will delay until the current access completes, plus one cycle a load (or fld) following a store cache hit - one clock delayed branch not taken - costs one clock nondelayed branch taken - one clock for bc, bnc; two for bte, btne. bri - one clock st.c - two clocks there is not forwarding from the graphics unit to the adder, multiplier, or itself, so there is one cycle of latency there a flush has two cycles of latency an fst takes one cycle to get the value out of the register, so if the next instruction overwrites the register being stored, it will stall >> The End << And that, boys and girls, is basically the complete contents of the programmer's reference manual. Enjoy! (52K ug... let's see if we can bomb any mailers!) -- -Colin (uunet!microsoft!w-colinp) "Don't listen to me. I never do." - The Doctor