Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!wuarchive!uunet!mcsun!ukc!acorn!armltd!abaum From: abaum (Allen Baum) Newsgroups: comp.arch Subject: Re: endian etc Message-ID: <166@armltd.uucp> Date: 13 May 91 10:37:17 GMT References: <3407@spim.mips.COM> Sender: abaum@armltd.uucp Distribution: comp Organization: A.R.M. Ltd, Swaffham Bulbeck, Cambs, UK Lines: 54 In article <3407@spim.mips.COM> zalman@mips.com (Zalman Stern) writes: >A real solution would be to add byte lane swapping hardware to the chip. ..... >So why don't we do this? The word over here in software is that the >hardware is expensive in terms of space and time. Its very likely to end up >on the critical path for loads and stores. Any impact on cycle time to >support bi-endianess is a lose. I've thought about this a bit, and I'm not sure its true. All the byte lane switching hardware already exists for the low order byte; you've got to be able to select any of the four bytes to be placed in the low byte of the register for load_bytes (and move the low byte to any of the four for store_bytes). The next byte has to select from only its own position and the MSByte for load_halfword (& reverse for store_bytes). The upper halfword doesn't get mucked with at all, except for getting sign extended or cleared. So, the critical path already exists. The muxing isn't terribly symmetrical- all the work goes into the LS byte, almost none into the MS byte. That means the layout probably has holes in it, which could be filled with the rest of the byte lane logic, at (here I'm speculating some) no extra cost in space, or time. Not quite true, of course. It will take some extra time to generate the mux control signals, and some extra time to buffer them to cover all four bytes. All this can be done in parallel with the load, however, and should cost no extra time. So there. (Can anyone who actually knows what they're talking about please refute?) By the way, note that while the buffering, etc. can go on in parallel, if sign extension is required, it can't - you can have muxes set up for sign selection, but you have to wait until it gets their, buffer it so it can drive 24 loads, and then stick it into all those upper bit positions. As a sweeping generalization, the path from cache to registers/forwarding path is THE critical path (if it isn't, you've probably done the design wrong, or have a CISC architecture) (This is just a sweeping generalization, mind you, not truth). So, this extra bit of gate delay can conceivably have quite an impact on your cycle time. Which is why HP-PA doesn't have sign extending loads. > >> Wouldn't it be very easy on a machine with a write back cache >>to copy words simply by changing the internal cached address ( a little >>like a form of cache aliasing). A lot of time is spent in most code >>just copying things around. Would this not improve things ( you gain >>immediately on cache occupancy). > This can work, but only on entire lines, and only for line aligned addresses. As noted, this is very cache organization sensitive, and you might want to restrict it to stuff in the kernal. It might work very well for frame buffer and entire page moves.