Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!agate!helios.ee.lbl.gov!ncis.llnl.gov!lll-winken!uunet!portal!cup.portal.com!bcase
From: bcase@cup.portal.com (Brian bcase Case)
Newsgroups: comp.arch
Subject: Re: How to use silicon (was Re: Intel/MIPS Dhrystone ratio)
Message-ID: <16058@cup.portal.com>
Date: 20 Mar 89 23:48:59 GMT
References: <37196@bbn.COM> <1989Mar16.190043.23227@utzoo.uucp> <24889@amdcad.AMD.COM> <355@bnr-fos.UUCP>
Organization: The Portal System (TM)
Lines: 107

>..., but I would doubt that high performances chips of
>the future (near future) will worry about another adder. Adding more
>ports to register files is a challenge for the silicon groups, but, hay,
>they got to earn their money too!

Right, another adder is not a problem.  Right, more ports on register
files is not too much of a problem, but is adding a port just so that
auto-increment goes fast the right thing?  I say no.  Keep reading.

>Caches (especially Data-cache) top out very easily. For many Unix-box
>type application, we have already reached the point of vastly-diminished
>return. Adding more cache won't bring your performance up.  (Or I could
>talk about our application where the diminishing return starts *before*
>you start adding cache.)

Maybe so, but I think that we have a little farther to go than 4K
instruction and 8K data, virtual no less.  Even if we have 2 million
transistors, I don't think the caches are going to be "too big."

>It is precisely because the CPU is running faster than memory (even 
>cached memory) that one have to maximize the amount of work done in
>each memory cycle.

"Running faster than memory" is a very misleading statement.  Mabye
latencies are a problem, but bandwidth can be had in abundance.  The
problem is what to do with it, not how to get it!

>Adding address modes does not mean a difficult machine to pipeline or
>to design. Don't assume that all architecture with many addressig modes
>will be as messy as the VAX instruction encoding.

It's true that not all architectures with many addressing modes will be as
messy as the VAX.  However, you simply have to answer the question:  are
those addressing modes, beyond register+register and register+offset,
really buying you anything?  (BTW, the lack of even those doesn't seem
to cripple the 29K too much.... If you dislike the 29K because of its
lack of addressing modes, blame me.  :-)

>With enough gates, the CPU will get more functional units, this means
>small RISC instructions will not be able to keep all the functional
>units busy. The i860 solves this by essentially have a short VLIW mode

I thought RISC instructions were too big!  :-) :-)  Note that the i860's
dual-instruction mode is essentially a VLIW mode.

>(hmmm, Short Very Long?). It is also possible to have bigger instructions
>that keep more units busy longer. Please note that different implementations
>of the architecture can have a super-fast version that does all instructions
>in single clock (at least in dispatching of instruction) and also have a 
>cheap version that is (here it comes:) *micro-coded* or whatever.

I claim a much better use of multiple functional units is to execute many
"small" RISC instructions at the same time, i.e., "super scalar" or
multiple instructions per clock.  It just doesn't make sense to bundle,
bind is a better word, many operations into one instruction.  Doing so
simply thwarts compiler optimization.  Adresssing modes are probably the
worst form of semantic binding, in my opinion.  So, if we are going to
have "too many transistors," we should use them to realize a superscalar
*implementation*, not a complex *architecture.*

>Even now, there are real money issues in memory alignments. If you have
>a system with a 100 MegaBytes main memory and "correct" alignment makes 
>it 150 MB, you have just made the system 50% more expensive. Or how about
>alignment bumps your memory requirement from 63K to 65K causing extra chips
>and possible board layout problems (not to mention the cost)?

If you can't afford to go to 150 Mbytes (or more likely, paging) or you
can't afford to go to 65K of RAM (try getting such a small amount, I 
challenge you), then performance must not be the most important thing.
By all means, then, you should allow un-naturally aligned data and you
can handle it in hardware or software, as you wish.  If performance is
your first priority, which it pretty much is in everything but the
cheapest embedded systems (which is also accounts for the highest volume!),
then you *don't* want to allow un-natural alignment.

>Having the H/W be tolerant of alignment means a lot of flexibility in the
>design trade-off.

?????

>Also, with more and more gates on a chip, it is conceivable that someone 
>will put together a cache that can handle misalignment in the cache, as
>long as the whole data item is in the same line. I.e., data can cross

The problem is not implenting hardware handling of misalignment; the
problem is the performance implication.  A mis-aligned load/store takes
two accesses; a good compiler or programmer will know this and align the
access whether the hardware can handle it or not.  So what's the point
of having the hardware?  If data must be packed as tightly into memory
as possible, then fine, but you must know that you are giving up
performance.  At this point, performance no longer is the first priority,
so handling it in software is probably acceptable (with simple primitives
like those of the MIPS processor, e.g.).

>word boundry with little or no penalty, but crosing line boundary will
>be slow or disallowed.  With the trend to wider
>buses (i.e., wider line size), this may well make the performance penalty
>of mislaignment neglectable.

I'm not sure how feasible it is to force the compiler/programmer to know
whether or not data is going to cross a cache line.  Things like
dynamically-allocated data structures might be a problem; this would
take some thought.  But, without much thought needed, I do know that the
hardware needed to permit misaligned access within a cache line is likely
to make cache access slower.  Since cache access is probably the limiting
stage in integer pipelines (maybe not in FP, but maybe), this is not a
good idea.