Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!seismo!gatech!amdcad!bcase
From: bcase@amdcad.UUCP
Newsgroups: comp.arch
Subject: Word vs. Byte Orientation
Message-ID: <16122@amdcad.AMD.COM>
Date: Mon, 13-Apr-87 15:54:25 EST
Article-I.D.: amdcad.16122
Posted: Mon Apr 13 15:54:25 1987
Date-Received: Wed, 15-Apr-87 00:56:41 EST
Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca.
Lines: 326

In article <279@winchester.mips.UUCP> mash@winchester.UUCP (John Mashey) writes:
>In article <16038@amdcad.AMD.COM> bcase@amdcad.AMD.COM (Brian Case) writes:
>Well, I've been staying away from this, but since you asked, OK.
>But Brian, I'm sad to say you may not be happy when you're finished
>reading this.

Well, I guess am not overjoyed or anything, but if you meant "I think you
will regret designing the Am29000 with the emphasis on word-addressing"
you are wrong.  If you meant "You'll be sorry you asked", never!  Let's
have some lively discussions!  I'll try make my points with as few words
as possible.

>First, (Geoff Steckel) <466@alliant.UUCP> posted a pretty good overall
>analysis of the issues, so I won't repeat that, except:
>"Re bus width, byte extraction, unaligned operands, and memory speed:
>  1) Byte extraction from words should be free in time; it'll cost a few gates.
>	Basically this requires one or more cross-couplings in the memory path.
>
>Yup, number 1 turns out to be true: MIPS R2000s pay no noticable cycle-time
>penalty for having load-byte, load-byte unsigned, load-half (signed and
>unsigned), and even load-partial-word (left and right) for dealing with
>unaligned operands).  It take some silicon, but it didn't add to the
>critical path. [Don't ask me how this is done, but I assure you it is
>possible.]

Sigh, byte extraction *does* take only a few gates.  Silicon area was
never the issue.  Depending, I guess greatly, on implementation details,
it is *not* free in time.  I'll try to get our circuit designer(s) to post
a comment on this one.  All I can say is that we designed the Am29000 to
*start* at 25 Mhz and go from there.  There is a critical path in the chip
from the pins to the execute unit.  Yes, I know this circuitry will scale
as well as any on the chip, but the set-up and hold time effects may not.

>Now, for some history: Brian earlier cited the 1982 paper "Hardware/Software
>Papers for Increased Performance" by Hennessy, et al, which argued
>fairly heavily for word-addressed memory with byte extract/insert.
>Now, there are the following facts:
>	I.e., they have changed their minds, at least somewhat.

I don't know how to reconcile the facts that some of the best, most
important computer scientists think word addressing is wrong with the fact
that it seems so right for us.  All I can say is that Titan and MIPS
machines have the advantage of being designed as a "closed" system; i.e.
nearly all (system) details are controllable.

>The remainder of this will deal with some structural reasons why word
>addressing with extract/insert is painful in certain environments,
>followed by a bunch of statistical analyses that describe the performance
>loss MIPS machines would suffer if we did it that way.
>
>A. Structural reasons: (this is mostly systems, maybe not some controllers)
>1) Memory system design:
>	a) Memory system designs with dual-porting of memory, including
>	an I/O bus, usually have to respond to partial-word operations
>	to keep I/O controllers happy.  Having done that, it is perfectly
>	feasible to deal with byte writes, either cheaply, with parity,
>	or somewhat more expensively with ECC.

I/O, especially where older chips (serial ports, etc.) are concerned, is
a grungy issue.  But I would most *certainly* not slow the memory system
with respect to the processor just for the sake of very infrequent
interchanges with I/O chips.  Notice:  an on-chip alignment network for
byte extraction/insertion (the only alignment network implementation that
makes sense in the majority of instances and the one that we are debating
here, I am assuming) does *not* solve this problem (it cannot do the
alignment needed by the I/O devices when they deal directly with memory),
the bulk of I/O to memory transactions are block transfers from disk and 
tape devices.  Why can't these be word-oriented?  For the cheapest systems,
it might be nice to hang the old serial port chip right on the processor
bus, but I don't think you want to buy an Am29000 (or a MIPS chip) just
so you can slow the thing down with stupid system design.  Don't get me
wrong, I am not flaming:  just trying to point out that I/O is something
to be dealt with separately from the processor-memory channel design.
Dual-ported memory is *not* the only way:  how about a DMA chip to do
all the alignment/bus-isolation?

>	b) Some systems use block-oriented buses, often with write-back
>	caches.  If the system is doing write-back for you, doing
>	load-word [causing WB-cache to fetch the cache line, if needed]
>	insert byte
>	store-word
>	VERSUS JUST:
>	store-byte [causing cache line to be fetched, if necessary]
>	sure looks like there is at least a 2-cycle hit, maybe a 3-cycle
>	hit, if you don't have 1-cycle cache accesses.

I think you have a good point here.  Caches are nice in that they often
don't have ECC so byte writing is much more feasible.  However, this is
only one possible memory system design.  The Am29000 will be interfaced
to many different kinds of memory systems.  At 30 MHz and beyond (where
the Am29000 is intended to be), word-addressing is thought by us to be
beneficial in many of these environments.

>	c) Some systems use write-thru caches with write-buffers
>	[VAX-780, I assume 8700s, etc, although not 8600/8650].
>	Sometimes the write-buffers gather contiguous bytes, then send
>	a whole word to memory. Again, having code that does lw/insert/sw
>	just adds cycles.

Another good point.  Same comments in general.

>2) I/O system design.  This is clearly not true of all systems, but
>you run into it often enough. IT IS OFTEN EXCEEDINGLY INCONVENIENT
>TO BE REQUIRED TO FETCH OR STORE A WORD OF A TIME WHEN DEALING WITH
>DEVICE CONTROLLERS.  [other stuff]

Agreed, but again, why not solve the problem (with an interface chip
or design approach) instead of propagating it to the processor-memory
channel?  I have sympathy for OS people:  I was an OS person for just
a short while.  The choice between dumb system design and creating
problems for the OS people when they must deal with older chips/boards
is a tough one (really, because:  the OS is part of the system design
too).

>As I read the 29000 specs, maybe it would be possible to use both modes,
>where main memory uses word+insert/extract, and the I/O path has the
>alignment network, and uses the load/store control fields to yield
>partial-word ops.  It will be interesting to see how the C compiler
>compiles a device driver that uses both memory and I/O addresses...
>There's probably some way around it, but I do belive that it's more than
>picking up an off-the-shelf controller and it's associated driver,
>making a few tweaks, and running it.

Well, we don't have any plans right now for a compiler which would
allow "mixed-mode" memory orientation.  More likely, some (a significant
amount?  Just a little?  In between?  I don't know...) assembly language
programming will have to be done.  Perhaps the OS guys will start forcing
the hardware guys to design-in only coherently-designed peripheral chips
(if they exist) or forcing them to design hardware to hide (is this
possible? in some cases it is) the problems.

>B. Performance reasons.
>Domain: running UNIX and UNIX programs well.
>
>1. Some qualitative observations:
>
>When I was at CT, I spent a bunch of time tuning 68K C compilers.
>In particular, I looked at the prevalance of code like:
>	move byte to register, extend, extend
>	move byte to register, and with 255 to get byte alone OR
>	clear register, move byte to register
>I was able to get noticable improvements in at least some programs
>by optimizing away some of the unnecesary cases.  IT was sure clear that
>a lot of cycles were burnt by the extends, or the and/clear, i.e.,
>one really wished for load byte [signed or unsigned].

Sigh, please don't tell me about how a *vastly* different processor with
*vastly* different time/instruction tradeoffs behaved.  I believe every
word with respect to the 68K, and it would be naive of me to say that
there is *nothing* valuable to be learned from your experience in that
experiment.  But to say that the results of that experiment have binding
implications for a processor like the Am29000 (and I am tempted to say
the MIPS, but I am certainly not qualified to do so) seems just wrong to
me.

>If simulations are based only on user-level programs, you can get
>some horrible surpises when you see what UNIX kernels do.  For example,
>are halfword operations really necessary?
>ANS: not if you look at their frequency in most UNIX C programs.
>ANS: if you look at kernel: you bet! many kernel structures are packed
>for efficiency, some are packed for necessity (you should see the pile
>of halfword operations in Ethernet code... and you CANNOT sanely get
>rid of them without rewriting everything).

I am sure that you are right; I really can't speak too well from experience.
The fact that we were simply inequipped to do kernel-level simulations
was one of our biggest weaknesses.  But again, even if in light of the
fact that the kernel does lots of sub-word size stuff, does this really
mean that the Am29000 should assume a byte-oriented/half-word oriented
memory?

>2. Some quantitative observations.  As most people in this newsgroup
>know, we do a huge amount of simulation on very large programs
>to analyze performance, look at different board designs and future
>chip tradeoffs.  We get complete instruction traces, so we get outputs
>that look like:
>Summary:

Wow, our simulation output looks much the same, with some of the numbers
being represented differently.  Great minds think alike.  :-)

>Thus, we have really precise statistics on what's going on, at least on
>our machines, at the user-level, for anything form typical UNIX programs
>(like nroff), to large simulators [spice, espresso],
>parts of the compiler system [assembler, optimizer, debugger],
>to benchmarks like whetstone, dhrystone, linpack.

Sigh, I wish we could do such simulations.

>I think one can find a gross cost [to us, in our architecture, no
>necessary applicability to others] in user programs, as follows,
>if we had done byte extract/insert, instead of what we did:
>
>For each partial-word load, add 1 cycle. (for the extract)
>For each partial-word store, add 2+N cycles (where you have a load,
>insert added, and where N (might be 0) is the extra actual cycle cost
>to get data from the cache, noting that some of the cost might be
>taken care of by pipeline scheduling.

This seems valid, at first glance, for your situation.  But it is not
directly applicable to the Am29000 because there is a *cost* associated
with on-chip byte support.  Thus, you gain some, you lose some.  We
see about twice as many loads as stores.  Plus, the stack cache decreases
the load/store percentage overall with respect to a machine (like the
MIPS) with "only" 32 fixed registers.  We seem to have about half as
many loads/stores, but it varies (and my compiler ain't the best, e.g.
no register coloring for memory-resident stuff).  This lower load/store
percentage might be another reason that word orientation is more appropriate
for the Am29000 (but note that a given system need not implement a stack
cache in the local register file (register banking for fast context
switching may be a better use of the registers); in this case, the load/
store percentage will go back up and bets are off; Sigh, what's a
computer architect to do?).

>So here a re a few example: I'll give the % of instruction cycles
>for each instruction, and compute the penalty by using N=0.
>I'll ignore numbers too small to matter much.
>
>as1 (assembler 1st pass)
>opcode	%	penalty (dynamic)
>lbu	4.6%	4.6%
>sb	1.5%	3.0%
>lh	0.27%	0.27%
>lhu	0.07%	0.07%
>sh	0.02%	0.02%
>TOT		8%  penalty in instruction cycles, asssuming N=0 (best case)

This is OK assuming that byte/halfword alignment costs nothing.  Again, I
am just drawing attention to this missing side of the argument.

>There is also a static code-size penalty, I'll only do one since I don't
>think this is a major issue, but it is interesting;
>opcode	%	penalty (static)
>lbu	4.7%	4.7%
>sb	3.2%	6.4%
>lh	0.27%	0.27%
>sh	0.14%	0.28%
>TOT		11.6%

Unquestionably there is a code size penalty.  This may or may not be an
issue given ROM/RAM constaints in some environments.

>Note the significance of the static numbers: the byte operations are all over
>the place, i.e., the dynamic counts aren't substantial just because they're
>in strcpy or something like that [actually we have tuned routines anyway],
>but because there's partial word code all over the place.

You are so right in pointing out that there is partial word code sprinkled
throughout many existing applications.  As an after-the-fact observation,
I guess that many Am29000 applications will be running "new" code.  Now,
whether or not the coders will know the right things to do (use the fast
library routines, etc.) is not knowable but nonetheless critical.  I guess
that means that we need to print some sort of "Am29000 Programming
Suggestions."

>
>Now, this is an ultra-simplistic analysis, because there are things like:
>write-buffer effects, cache effects, memory system interference,
>pipeline scheduling, etc, etc.  Consider this a first approximation.
>
>Now, a few more examples:
>
>Dhrystone:
>lbu	6.9%	6.9%
>lb	4.7%	4.7%
>lwl	1.2%	1.2%	(unaligned word stuff)
>lwr	1.2%	1.2%
>sb	0.43%	0.8%
>swl	0.14%	0.3%
>TOT		14.1%

But, just a few lines later you'll point out how having a word-oriented
processor-memory channel *helps* (artifically since dhrystone is
artrificial) dhrystone performance.  I'm sorry, but you must to stick
to one argument. :-)

>(This has nothing do to do with word-vs-byte, but I ran across it in
>looking at these numbers).
>QUIZ: how many load/stores use 0-displacements off the base register,
>rather than non-zero ones?
>
>ANS: a few were around 50%.
>	most were in the 10-20% range.
>	some were down in the 5-10% range.
>	Dhrystone: 50%
>I.e., Dhrystone uses zero-offset addressing considerably more than
>most programs, although not more than all programs. [Relevant to 29000
>discussion, if you remember how they did things.]

Just in case you are trying to make a subtle intimation:  WE DID NOT
"OPTIMIZE" THE AM29000 ARCHITECTURE FOR ANY PARTICULAR PROGRAM.  The
architecture was pretty much fixed before we had significant simultion
results (I know, I know; that was the wrong way to do things, but we
had no choice).  We *did* add the now-infamous compare-bytes instruction
very late (after we had simulation results).  I wanted the load/store
instructions to have only register-indirect addressing mode from the
beginning, but only for the sake of simplicity and optimization
opportunities.  In the end, we realized that we had done a great thing:
As far as normal instruction execution is concerned, there cannot be
contention between jump and load/store instructions for the TLB.  With
our pipeline, an addressing mode would have been a minor disaster.

>WHEW.  That was a lot of info.  Sorry about that, but architectural
>arguments cannot be settled by intuition.  Note again that these are
>the numbers we get, and you cannot analyze choices in a vacuum,
>so they may or may not be relevant to other architectures and software.

Yes; this is an important point.  Rarely, if ever, does a team implement
in the same technology two versions of a processor with just one variable
(e.g. byte alignment/no byte alignment) changed.  That would be nice.

>In our case, this does say:
>	a) Byte instructions are a substantial win on many real programs.
>	b) Non-zero offsets are frequently-used.

(But less frequently when there is a stack cache.)

>and finally, for everybody:
>	c) Be very, very careful on WHICH benchmarks you use to tune
>	your architecture.  DON'T use Dhrystone.

This is good advice.

Thanks, John, for taking the time to post.

    bcase