Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!wuarchive!uunet!mcsun!ukc!acorn!armltd!abaum
From: abaum (Allen Baum)
Newsgroups: comp.arch
Subject: Re: endian etc
Message-ID: <166@armltd.uucp>
Date: 13 May 91 10:37:17 GMT
References: <3407@spim.mips.COM>
Sender: abaum@armltd.uucp
Distribution: comp
Organization: A.R.M. Ltd, Swaffham Bulbeck, Cambs, UK
Lines: 54

In article <3407@spim.mips.COM> zalman@mips.com (Zalman Stern) writes:

>A real solution would be to add byte lane swapping hardware to the chip.
.....
>So why don't we do this? The word over here in software is that the
>hardware is expensive in terms of space and time. Its very likely to end up
>on the critical path for loads and stores. Any impact on cycle time to
>support bi-endianess is a lose.

I've thought about this a bit, and I'm not sure its true. All the byte lane
switching hardware already exists for the low order byte; you've got to be
able to select any of the four bytes to be placed in the low byte of the
register for load_bytes (and move the low byte to any of the four for 
store_bytes). The next byte has to select from only its own position and the
MSByte for load_halfword (& reverse for store_bytes). The upper halfword
doesn't get mucked with at all, except for getting sign extended or cleared.

So, the critical path already exists. The muxing isn't terribly symmetrical-
all the work goes into the LS byte, almost none into the MS byte. That means
the layout probably has holes in it, which could be filled with the rest of
the byte lane logic, at (here I'm speculating some) no extra cost in space, or
time.

Not quite true, of course. It will take some extra time to generate the
mux control signals, and some extra time to buffer them to cover all four
bytes. All this can be done in parallel with the load, however, and should
cost no extra time. So there. (Can anyone who actually knows what they're 
talking about please refute?)

By the way, note that while the buffering, etc. can go on in parallel, if
sign extension is required, it can't - you can have muxes set up for sign
selection, but you have to wait until it gets their, buffer it so it can
drive 24 loads, and then stick it into all those upper bit positions. 

As a sweeping generalization, the path from cache to registers/forwarding
path is THE critical path (if it isn't, you've probably done the design
wrong, or have a CISC architecture) (This is just a sweeping generalization,
mind you, not truth). So, this extra bit of gate delay can conceivably have
quite an impact on your cycle time. Which is why HP-PA doesn't have sign
extending loads.

>
>>	Wouldn't it be very easy on a machine with a write back cache
>>to copy words simply by changing the internal cached address ( a little
>>like a form of cache aliasing). A lot of time is spent in most code
>>just copying things around. Would this not improve things ( you gain
>>immediately on cache occupancy).
>

This can work, but only on entire lines, and only for line aligned addresses.
As noted, this is very cache organization sensitive, and you might want to 
restrict it to stuff in the kernal. It might work very well for frame buffer
and entire page moves.