Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!think!ames!ptsfa!ihnp4!chinet!steinmetz!jesup
From: jesup@steinmetz.UUCP
Newsgroups: comp.arch
Subject: Re: Word vs. Byte Orientation
Message-ID: <1449@steinmetz.steinmetz.UUCP>
Date: Tue, 14-Apr-87 23:29:48 EST
Article-I.D.: steinmet.1449
Posted: Tue Apr 14 23:29:48 1987
Date-Received: Fri, 17-Apr-87 03:24:47 EST
References: <16122@amdcad.AMD.COM>
Reply-To: jesup@kbsvax.steinmetz.UUCP (Randell Jesup)
Distribution: na
Organization: General Electric CRD, Schenectady, NY
Lines: 44

[re: discussion on word addressed memory vs. byte addressed w/ alignment net]

	Having a alignment network on chip does not necessarily cost you
in critical path, depending on your design.  In the one design I am familar
with, the net doesn't cost us anything, even at considerably more than 30 MHz.
It is done in the end of the cycle that latches it onto the chip (if I remember
correctly).  In any case, it is not on the critical path.
	In the other direction, it goes through a network again, and 4 lines
are driven as appropriate (write lines for each byte.)
	According to our figures, load/stores are about 40-50%, with about
2 loads/1 store.
	Lack of direct byte support can (depending on application) cost you
a fair amount.
	It all comes down to the hardware:  If it costs you more cycles
(on average) to add the alignment net than it would cost to synthesize the
the byte/halfword load/stores from word load/stores and byte insert/extract
instructions, then don't use the net.  But if it's even, or in favor of the 
net, definitely use the net.  If you don't, you'll need to decode at least
two more instructions.
	Picking numbers out of the air, if the alignment net costs 1 cycle
on loads and stores, and 65% are loads, and 40% of instructions are loads or
stores, and the extra cycle can't be filled 50% of the time, it will cost you:
	40% * 65% * 1 cycle * 50% ~= .1 cycles/instruction
(the extra cycle on a store doesn't block later instructions, just takes
longer for it to get to memory.)
	If you must do byte insert/extract, each of which costs one cycle,
(I assume there are halfword insert extract, otherwise it'll be worse), and
assuming 80% are word, 20% non-word, it will cost you:
	40% * 100% * 1 cycle * 20% ~= .1 cycles/instruction
	If there are destination/source interlocks, and 50% of the interlocks
are fillable, add .5 cycles/instruction to that, making it .6 cycles/
instruction.
	Now, all these numbers are fiction, but they aren't far from the
actual numbers we see in our data (I'm at home now).  From my point of view,
any work that might reduce the penalty of the alignment net to less than
1 cycle is a big win (and as I said, 0 is definitely possible, even above
30MHz, depending on design.)  Also, you reduce the decode complexity (maybe),
by having a smaller number of instructions.  If you can save .1 cycles/
instruction, you should get about 10% performance increase.  Worth a lot
of work and silicon, if you ask me.

	Randell Jesup
	jesup@steinmetz.uucp
	jesup@ge-crd.arpa