Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!apple!amdcad!mozart.amd.com!nucleus!davec
From: davec@nucleus.amd.com (Dave Christie)
Newsgroups: comp.arch
Subject: Re: Is handling off-alignment important?
Message-ID: <1990Jul25.223437.15301@mozart.amd.com>
Date: 25 Jul 90 22:34:37 GMT
References: <104037@convex.convex.com> <8840016@hpfcso.HP.COM> <2370@crdos1.crd.ge.COM>
Sender: usenet@mozart.amd.com (Usenet News)
Reply-To: davec@nucleus.amd.com (Dave Christie)
Organization: Advanced Micro Devices, Inc., Austin, Texas
Lines: 81

In article <2370@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>
>  Alternatively the hardware can support unalligned fetch. It doesn't
>have to be efficient, because you would have to make an effort to make
>the fetch logic slower than software, it just has to work. This makes
>the program a bit smaller, and assuming that the chip logic is right, it
>prevents everyone from implementing their own try at access code.
   .
   .
>  Note that this is not a RISC issue, in that the bus interface unit
>already may be doing things like cache interface, multiplexing lines,
>controlling status lines, etc. The BIU is not really RISC in that sense,
>it functions like a coprocessor if you draw a logic diagram, who's
>function is to provide data, which can go in the pipeline or into the
>CPU.

Note that there are two degrees of misalignment: 
	1) within a word, and
	2) crossing a word (& possible page) boundary. 

For 1):
If the realignment hardware is not in your main fetch path because it
would impact your cycle time, then it will likely mean an extra stage
of processing for instructions which use it, which can add various bits
of complexity.  Considering that, plus
	1) a 4-way mux isn't a serious time sink, and
	2) how much, or even whether, it influences the cycle time is 
	   technology and implementation dependent
then you are likely just going to stick it in the main fetch path
and do it efficiently, w.r.t. layout, etc.  Now, if the end user
does pay for this, it isn't likely going to be in performance, because
even though it might influence the cycle time, it won't.  Chips come in
"standard" operating frequencies these days (e.g. 16,20,25,30,40,50);
The difference that a 4-way mux might make would tend to be taken care
of by the process tweaking that's done to get to the desired frequency.
In this case, the realignment hardware influences yield rather than
cycle time, hence cost rather than the performance.  I can't think of
any processor that doesn't support this degree of realignment (some
better than others).

For 2):
This, IMHO, is one of the more significant things that differentiates
"RISC" from "CISC".  The notion of one instruction making multiple
references to memory tends to make RISC designers get red in the face
and jump up and down.  (Yes, I'm well aware of the 29K's load and store
multiple instructions, and while I'm not fond of them, there are some
significant differences between that and handling unaligned accesses.)
The extra control complexity this introduces is a signficant increment,
especially considering all the nightmarish endcases that have already 
been described in this thread.  The added complexity is dependent on
architecture and implementation, and tends to be worse for stores, but
at any rate it tends to increase design/debug time, and more importantly
can cause much hair pulling and resume writing when one attempts really
high performance implementations.  (I've know people who thrive on such
complexity, for complexity's sake - they should be removed from the gene 
pool (0.5 :-).  With the realestate one has to play with these days,
you can find room for the complexity to keep the performance up, but it
still influences the cost (and number of errata after release).

I don't know of any "new" architecture chips with decent performance
that support realignment across words in one instruction. Why do
the common CISC chips support it?
	1) it's not as big an increment in complexity (no smiley)
	2) backwards compatibility (i.e. they have no choice)

In summary, the cost you will tend to see will be $ more than performance,
although at the high end of the performance spectrum you might pay in
performance as well - that's hard to say, since processors which support
word-crossing accesses tend to have a lot of other complexities which
influence cost/performance as well.

What makes sense depends on the intended applications, of course.  It 
may indeed make some network software run significantly faster, for 
instance.  But if that network software consumed 5% of all the cycles 
of all the processors I had sold, and such hardware support would 
*double* the n/w sfw performance, I still wouldn't risk screwing
up everything else to go for an aggregate 2.5% performance improvement.

----------------------------
Dave Christie
My humble opinions only.