Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!think!ames!ptsfa!ihnp4!chinet!steinmetz!jesup From: jesup@steinmetz.UUCP Newsgroups: comp.arch Subject: Re: Word vs. Byte Orientation Message-ID: <1449@steinmetz.steinmetz.UUCP> Date: Tue, 14-Apr-87 23:29:48 EST Article-I.D.: steinmet.1449 Posted: Tue Apr 14 23:29:48 1987 Date-Received: Fri, 17-Apr-87 03:24:47 EST References: <16122@amdcad.AMD.COM> Reply-To: jesup@kbsvax.steinmetz.UUCP (Randell Jesup) Distribution: na Organization: General Electric CRD, Schenectady, NY Lines: 44 [re: discussion on word addressed memory vs. byte addressed w/ alignment net] Having a alignment network on chip does not necessarily cost you in critical path, depending on your design. In the one design I am familar with, the net doesn't cost us anything, even at considerably more than 30 MHz. It is done in the end of the cycle that latches it onto the chip (if I remember correctly). In any case, it is not on the critical path. In the other direction, it goes through a network again, and 4 lines are driven as appropriate (write lines for each byte.) According to our figures, load/stores are about 40-50%, with about 2 loads/1 store. Lack of direct byte support can (depending on application) cost you a fair amount. It all comes down to the hardware: If it costs you more cycles (on average) to add the alignment net than it would cost to synthesize the the byte/halfword load/stores from word load/stores and byte insert/extract instructions, then don't use the net. But if it's even, or in favor of the net, definitely use the net. If you don't, you'll need to decode at least two more instructions. Picking numbers out of the air, if the alignment net costs 1 cycle on loads and stores, and 65% are loads, and 40% of instructions are loads or stores, and the extra cycle can't be filled 50% of the time, it will cost you: 40% * 65% * 1 cycle * 50% ~= .1 cycles/instruction (the extra cycle on a store doesn't block later instructions, just takes longer for it to get to memory.) If you must do byte insert/extract, each of which costs one cycle, (I assume there are halfword insert extract, otherwise it'll be worse), and assuming 80% are word, 20% non-word, it will cost you: 40% * 100% * 1 cycle * 20% ~= .1 cycles/instruction If there are destination/source interlocks, and 50% of the interlocks are fillable, add .5 cycles/instruction to that, making it .6 cycles/ instruction. Now, all these numbers are fiction, but they aren't far from the actual numbers we see in our data (I'm at home now). From my point of view, any work that might reduce the penalty of the alignment net to less than 1 cycle is a big win (and as I said, 0 is definitely possible, even above 30MHz, depending on design.) Also, you reduce the decode complexity (maybe), by having a smaller number of instructions. If you can save .1 cycles/ instruction, you should get about 10% performance increase. Worth a lot of work and silicon, if you ask me. Randell Jesup jesup@steinmetz.uucp jesup@ge-crd.arpa