Path: utzoo!attcan!utgpu!watserv1!watdragon!rose!ccplumb From: ccplumb@rose.waterloo.edu (Colin Plumb) Newsgroups: comp.arch Subject: Re: Intel 860 Architecture Message-ID: <19151@watdragon.waterloo.edu> Date: 10 Dec 89 05:42:27 GMT References: <3818@convex.UUCP> Sender: daemon@watdragon.waterloo.edu Reply-To: ccplumb@rose.waterloo.edu (Colin Plumb) Organization: U. of Waterloo, Ontario Lines: 69 In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes: >2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit? It's 3 stages for most things, and 2 for d.p multiplies. However, in the latter case, each stage takes 2 cycles, so you only get one result per 2 clocks. >3) What happens to the pipeline if there are page faults / exceptions > during dual operation mode? Does the pipeline advance one step > per clock cycle, or one step per floating instruction? I don't quite understand. The pipeline advances one stage per floating instruction. The instruction's dest specification specifies where to put the current result, not the result of the operation you're currently starting. The i860's exception handling is seriously wierd. It saves just barely enough information for an excpetion handling routine to figure out what went wrong and fix it. No fast context switches on this puppy! And even then, there are code constructs you have to avoid, like branching to the shadow of a delayed branch. It only saves one address, so the excpetion handler has to look back one instruction to see where it should resume... ugh. >4) Is is possible to do pipelined FP loads with non-unit stride? Certainly. The pipelined load business just makes the latency visible to the programmer; you still supply one address per load. There is no auto-increment feature. A pipelined load is just a load that doesn't get satisifed until after you've issued the next pipelined load; other than that it's normal. >5) Is it possible to do pipelined scatter/gather operations? Again, sure if you want to write the software to compute the scatter/gather business. I believe the load pipeline is 2 deep (I may have forgotten). This means the first two instructions you issue, supply addresses and bogus destination registers. The third pipelined load, supply the third address and the destination for the first load (which hopefully has completed by now). There's nothing you couldn't do with agressive scoreboarding and ordinary loads, except that not having to supply a destination register until the data is ready gives you another register for those few clocks. >6) The 860 doesn't seem to have integer multiplication instructions, > and also doesn't seem to have any integer to floating conversion > instructions. What are the best ways to do efficient integer > multiplication with the 860? Does this have something to do with > the fmlow instruction? Ug... I'm forgetting. I believe the fmlow instruction can do an integer multiply, and I'm pretty sure there are int<->fp conversion instructions. >All in all, it looks like a well thought out chip, with a lot of clever >architectural trade-offs to get everything on one chip. To be honest, I wasn't too impressed when I saw it. Lots of wierd non-orthogonalities and I still think the interrupt handling is a pig. But I believe some of the design team reads comp.arch; let them refute. (Note that I believe an interrupt take/return should take about twice a function call/return. The 29000 is still too slow, but shows how simple an interrupt handling structure can be. I still wonder what the chip is doing for all those cycles. Freeze staus registers, set supervisor mode, clear pipeline, and start fetching from a new address. A non-delayed jump with a little bit of fiddling.) -- -Colin