Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!gatech!amdcad!bcase From: bcase@amdcad.UUCP Newsgroups: comp.arch Subject: Word vs. Byte Orientation Message-ID: <16122@amdcad.AMD.COM> Date: Mon, 13-Apr-87 15:54:25 EST Article-I.D.: amdcad.16122 Posted: Mon Apr 13 15:54:25 1987 Date-Received: Wed, 15-Apr-87 00:56:41 EST Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca. Lines: 326 In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes: >In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes: >Well, I've been staying away from this, but since you asked, OK. >But Brian, I'm sad to say you may not be happy when you're finished >reading this. Well, I guess am not overjoyed or anything, but if you meant "I think you will regret designing the Am29000 with the emphasis on word-addressing" you are wrong. If you meant "You'll be sorry you asked", never! Let's have some lively discussions! I'll try make my points with as few words as possible. >First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall >analysis of the issues, so I won't repeat that, except: >"Re bus width, byte extraction, unaligned operands, and memory speed: > 1) Byte extraction from words should be free in time; it'll cost a few gates. > Basically this requires one or more cross-couplings in the memory path. > >Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time >penalty for having load-byte, load-byte unsigned, load-half (signed and >unsigned), and even load-partial-word (left and right) for dealing with >unaligned operands). It take some silicon, but it didn't add to the >critical path. [Don't ask me how this is done, but I assure you it is >possible.] Sigh, byte extraction *does* take only a few gates. Silicon area was never the issue. Depending, I guess greatly, on implementation details, it is *not* free in time. I'll try to get our circuit designer(s) to post a comment on this one. All I can say is that we designed the Am29000 to *start* at 25 Mhz and go from there. There is a critical path in the chip from the pins to the execute unit. Yes, I know this circuitry will scale as well as any on the chip, but the set-up and hold time effects may not. >Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software >Papers for Increased Performance" by Hennessy, et al, which argued >fairly heavily for word-addressed memory with byte extract/insert. >Now, there are the following facts: > I.e., they have changed their minds, at least somewhat. I don't know how to reconcile the facts that some of the best, most important computer scientists think word addressing is wrong with the fact that it seems so right for us. All I can say is that Titan and MIPS machines have the advantage of being designed as a "closed" system; i.e. nearly all (system) details are controllable. >The remainder of this will deal with some structural reasons why word >addressing with extract/insert is painful in certain environments, >followed by a bunch of statistical analyses that describe the performance >loss MIPS machines would suffer if we did it that way. > >A. Structural reasons: (this is mostly systems, maybe not some controllers) >1) Memory system design: > a) Memory system designs with dual-porting of memory, including > an I/O bus, usually have to respond to partial-word operations > to keep I/O controllers happy. Having done that, it is perfectly > feasible to deal with byte writes, either cheaply, with parity, > or somewhat more expensively with ECC. I/O, especially where older chips (serial ports, etc.) are concerned, is a grungy issue. But I would most *certainly* not slow the memory system with respect to the processor just for the sake of very infrequent interchanges with I/O chips. Notice: an on-chip alignment network for byte extraction/insertion (the only alignment network implementation that makes sense in the majority of instances and the one that we are debating here, I am assuming) does *not* solve this problem (it cannot do the alignment needed by the I/O devices when they deal directly with memory), the bulk of I/O to memory transactions are block transfers from disk and tape devices. Why can't these be word-oriented? For the cheapest systems, it might be nice to hang the old serial port chip right on the processor bus, but I don't think you want to buy an Am29000 (or a MIPS chip) just so you can slow the thing down with stupid system design. Don't get me wrong, I am not flaming: just trying to point out that I/O is something to be dealt with separately from the processor-memory channel design. Dual-ported memory is *not* the only way: how about a DMA chip to do all the alignment/bus-isolation? > b) Some systems use block-oriented buses, often with write-back > caches. If the system is doing write-back for you, doing > load-word [causing WB-cache to fetch the cache line, if needed] > insert byte > store-word > VERSUS JUST: > store-byte [causing cache line to be fetched, if necessary] > sure looks like there is at least a 2-cycle hit, maybe a 3-cycle > hit, if you don't have 1-cycle cache accesses. I think you have a good point here. Caches are nice in that they often don't have ECC so byte writing is much more feasible. However, this is only one possible memory system design. The Am29000 will be interfaced to many different kinds of memory systems. At 30 MHz and beyond (where the Am29000 is intended to be), word-addressing is thought by us to be beneficial in many of these environments. > c) Some systems use write-thru caches with write-buffers > [VAX-780, I assume 8700s, etc, although not 8600/8650]. > Sometimes the write-buffers gather contiguous bytes, then send > a whole word to memory. Again, having code that does lw/insert/sw > just adds cycles. Another good point. Same comments in general. >2) I/O system design. This is clearly not true of all systems, but >you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT >TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH >DEVICE CONTROLLERS. [other stuff] Agreed, but again, why not solve the problem (with an interface chip or design approach) instead of propagating it to the processor-memory channel? I have sympathy for OS people: I was an OS person for just a short while. The choice between dumb system design and creating problems for the OS people when they must deal with older chips/boards is a tough one (really, because: the OS is part of the system design too). >As I read the 29000 specs, maybe it would be possible to use both modes, >where main memory uses word+insert/extract, and the I/O path has the >alignment network, and uses the load/store control fields to yield >partial-word ops. It will be interesting to see how the C compiler >compiles a device driver that uses both memory and I/O addresses... >There's probably some way around it, but I do belive that it's more than >picking up an off-the-shelf controller and it's associated driver, >making a few tweaks, and running it. Well, we don't have any plans right now for a compiler which would allow "mixed-mode" memory orientation. More likely, some (a significant amount? Just a little? In between? I don't know...) assembly language programming will have to be done. Perhaps the OS guys will start forcing the hardware guys to design-in only coherently-designed peripheral chips (if they exist) or forcing them to design hardware to hide (is this possible? in some cases it is) the problems. >B. Performance reasons. >Domain: running UNIX and UNIX programs well. > >1. Some qualitative observations: > >When I was at CT, I spent a bunch of time tuning 68K C compilers. >In particular, I looked at the prevalance of code like: > move byte to register, extend, extend > move byte to register, and with 255 to get byte alone OR > clear register, move byte to register >I was able to get noticable improvements in at least some programs >by optimizing away some of the unnecesary cases. IT was sure clear that >a lot of cycles were burnt by the extends, or the and/clear, i.e., >one really wished for load byte [signed or unsigned]. Sigh, please don't tell me about how a *vastly* different processor with *vastly* different time/instruction tradeoffs behaved. I believe every word with respect to the 68K, and it would be naive of me to say that there is *nothing* valuable to be learned from your experience in that experiment. But to say that the results of that experiment have binding implications for a processor like the Am29000 (and I am tempted to say the MIPS, but I am certainly not qualified to do so) seems just wrong to me. >If simulations are based only on user-level programs, you can get >some horrible surpises when you see what UNIX kernels do. For example, >are halfword operations really necessary? >ANS: not if you look at their frequency in most UNIX C programs. >ANS: if you look at kernel: you bet! many kernel structures are packed >for efficiency, some are packed for necessity (you should see the pile >of halfword operations in Ethernet code... and you CANNOT sanely get >rid of them without rewriting everything). I am sure that you are right; I really can't speak too well from experience. The fact that we were simply inequipped to do kernel-level simulations was one of our biggest weaknesses. But again, even if in light of the fact that the kernel does lots of sub-word size stuff, does this really mean that the Am29000 should assume a byte-oriented/half-word oriented memory? >2. Some quantitative observations. As most people in this newsgroup >know, we do a huge amount of simulation on very large programs >to analyze performance, look at different board designs and future >chip tradeoffs. We get complete instruction traces, so we get outputs >that look like: >Summary: Wow, our simulation output looks much the same, with some of the numbers being represented differently. Great minds think alike. :-) >Thus, we have really precise statistics on what's going on, at least on >our machines, at the user-level, for anything form typical UNIX programs >(like nroff), to large simulators [spice, espresso], >parts of the compiler system [assembler, optimizer, debugger], >to benchmarks like whetstone, dhrystone, linpack. Sigh, I wish we could do such simulations. >I think one can find a gross cost [to us, in our architecture, no >necessary applicability to others] in user programs, as follows, >if we had done byte extract/insert, instead of what we did: > >For each partial-word load, add 1 cycle. (for the extract) >For each partial-word store, add 2+N cycles (where you have a load, >insert added, and where N (might be 0) is the extra actual cycle cost >to get data from the cache, noting that some of the cost might be >taken care of by pipeline scheduling. This seems valid, at first glance, for your situation. But it is not directly applicable to the Am29000 because there is a *cost* associated with on-chip byte support. Thus, you gain some, you lose some. We see about twice as many loads as stores. Plus, the stack cache decreases the load/store percentage overall with respect to a machine (like the MIPS) with "only" 32 fixed registers. We seem to have about half as many loads/stores, but it varies (and my compiler ain't the best, e.g. no register coloring for memory-resident stuff). This lower load/store percentage might be another reason that word orientation is more appropriate for the Am29000 (but note that a given system need not implement a stack cache in the local register file (register banking for fast context switching may be a better use of the registers); in this case, the load/ store percentage will go back up and bets are off; Sigh, what's a computer architect to do?). >So here a re a few example: I'll give the % of instruction cycles >for each instruction, and compute the penalty by using N=0. >I'll ignore numbers too small to matter much. > >as1 (assembler 1st pass) >opcode % penalty (dynamic) >lbu 4.6% 4.6% >sb 1.5% 3.0% >lh 0.27% 0.27% >lhu 0.07% 0.07% >sh 0.02% 0.02% >TOT 8% penalty in instruction cycles, asssuming N=0 (best case) This is OK assuming that byte/halfword alignment costs nothing. Again, I am just drawing attention to this missing side of the argument. >There is also a static code-size penalty, I'll only do one since I don't >think this is a major issue, but it is interesting; >opcode % penalty (static) >lbu 4.7% 4.7% >sb 3.2% 6.4% >lh 0.27% 0.27% >sh 0.14% 0.28% >TOT 11.6% Unquestionably there is a code size penalty. This may or may not be an issue given ROM/RAM constaints in some environments. >Note the significance of the static numbers: the byte operations are all over >the place, i.e., the dynamic counts aren't substantial just because they're >in strcpy or something like that [actually we have tuned routines anyway], >but because there's partial word code all over the place. You are so right in pointing out that there is partial word code sprinkled throughout many existing applications. As an after-the-fact observation, I guess that many Am29000 applications will be running "new" code. Now, whether or not the coders will know the right things to do (use the fast library routines, etc.) is not knowable but nonetheless critical. I guess that means that we need to print some sort of "Am29000 Programming Suggestions." > >Now, this is an ultra-simplistic analysis, because there are things like: >write-buffer effects, cache effects, memory system interference, >pipeline scheduling, etc, etc. Consider this a first approximation. > >Now, a few more examples: > >Dhrystone: >lbu 6.9% 6.9% >lb 4.7% 4.7% >lwl 1.2% 1.2% (unaligned word stuff) >lwr 1.2% 1.2% >sb 0.43% 0.8% >swl 0.14% 0.3% >TOT 14.1% But, just a few lines later you'll point out how having a word-oriented processor-memory channel *helps* (artifically since dhrystone is artrificial) dhrystone performance. I'm sorry, but you must to stick to one argument. :-) >(This has nothing do to do with word-vs-byte, but I ran across it in >looking at these numbers). >QUIZ: how many load/stores use 0-displacements off the base register, >rather than non-zero ones? > >ANS: a few were around 50%. > most were in the 10-20% range. > some were down in the 5-10% range. > Dhrystone: 50% >I.e., Dhrystone uses zero-offset addressing considerably more than >most programs, although not more than all programs. [Relevant to 29000 >discussion, if you remember how they did things.] Just in case you are trying to make a subtle intimation: WE DID NOT "OPTIMIZE" THE AM29000 ARCHITECTURE FOR ANY PARTICULAR PROGRAM. The architecture was pretty much fixed before we had significant simultion results (I know, I know; that was the wrong way to do things, but we had no choice). We *did* add the now-infamous compare-bytes instruction very late (after we had simulation results). I wanted the load/store instructions to have only register-indirect addressing mode from the beginning, but only for the sake of simplicity and optimization opportunities. In the end, we realized that we had done a great thing: As far as normal instruction execution is concerned, there cannot be contention between jump and load/store instructions for the TLB. With our pipeline, an addressing mode would have been a minor disaster. >WHEW. That was a lot of info. Sorry about that, but architectural >arguments cannot be settled by intuition. Note again that these are >the numbers we get, and you cannot analyze choices in a vacuum, >so they may or may not be relevant to other architectures and software. Yes; this is an important point. Rarely, if ever, does a team implement in the same technology two versions of a processor with just one variable (e.g. byte alignment/no byte alignment) changed. That would be nice. >In our case, this does say: > a) Byte instructions are a substantial win on many real programs. > b) Non-zero offsets are frequently-used. (But less frequently when there is a stack cache.) >and finally, for everybody: > c) Be very, very careful on WHICH benchmarks you use to tune > your architecture. DON'T use Dhrystone. This is good advice. Thanks, John, for taking the time to post. bcase