Path: utzoo!attcan!utgpu!watserv1!watdragon!rose!ccplumb
From: ccplumb@rose.waterloo.edu (Colin Plumb)
Newsgroups: comp.arch
Subject: Re: Intel 860 Architecture
Message-ID: <19151@watdragon.waterloo.edu>
Date: 10 Dec 89 05:42:27 GMT
References: <3818@convex.UUCP>
Sender: daemon@watdragon.waterloo.edu
Reply-To: ccplumb@rose.waterloo.edu (Colin Plumb)
Organization: U. of Waterloo, Ontario
Lines: 69

In article <3818@convex.UUCP> hamrick@convex.COM (Ed Hamrick) writes:
>2) How deep is the pipeline for 64 bit adds / multiplies? 32 bit?

It's 3 stages for most things, and 2 for d.p multiplies.  However,
in the latter case, each stage takes 2 cycles, so you only get one result
per 2 clocks.

>3) What happens to the pipeline if there are page faults / exceptions
>   during dual operation mode?  Does the pipeline advance one step
>   per clock cycle, or one step per floating instruction?

I don't quite understand.  The pipeline advances one stage per floating
instruction.  The instruction's dest specification specifies where to
put the current result, not the result of the operation you're
currently starting.

The i860's exception handling is seriously wierd.  It saves just
barely enough information for an excpetion handling routine to
figure out what went wrong and fix it.  No fast context switches
on this puppy!  And even then, there are code constructs you have
to avoid, like branching to the shadow of a delayed branch.  It
only saves one address, so the excpetion handler has to look back
one instruction to see where it should resume... ugh.

>4) Is is possible to do pipelined FP loads with non-unit stride?

Certainly.  The pipelined load business just makes the latency
visible to the programmer; you still supply one address per
load.  There is no auto-increment feature.  A pipelined load is
just a load that doesn't get satisifed until after you've issued
the next pipelined load; other than that it's normal.

>5) Is it possible to do pipelined scatter/gather operations?

Again, sure if you want to write the software to compute the scatter/gather
business.  I believe the load pipeline is 2 deep (I may have
forgotten).  This means the first two instructions you issue,
supply addresses and bogus destination registers.  The third pipelined
load, supply the third address and the destination for the first load
(which hopefully has completed by now).  There's nothing you couldn't
do with agressive scoreboarding and ordinary loads, except that not
having to supply a destination register until the data is ready gives
you another register for those few clocks.

>6) The 860 doesn't seem to have integer multiplication instructions,
>   and also doesn't seem to have any integer to floating conversion
>   instructions.  What are the best ways to do efficient integer
>   multiplication with the 860?  Does this have something to do with
>   the fmlow instruction?

Ug... I'm forgetting.  I believe the fmlow instruction can do an integer
multiply, and I'm pretty sure there are int<->fp conversion instructions.

>All in all, it looks like a well thought out chip, with a lot of clever
>architectural trade-offs to get everything on one chip.

To be honest, I wasn't too impressed when I saw it.  Lots of wierd
non-orthogonalities and I still think the interrupt handling is
a pig.  But I believe some of the design team reads comp.arch; let
them refute.

(Note that I believe an interrupt take/return should take about twice a
function call/return.  The 29000 is still too slow, but shows how
simple an interrupt handling structure can be.  I still wonder what the
chip is doing for all those cycles.  Freeze staus registers, set
supervisor mode, clear pipeline, and start fetching from a new
address.  A non-delayed jump with a little bit of fiddling.)
-- 
	-Colin