Path: utzoo!utgpu!bnr-vpa!bnr-fos!bnr-public!schow
From: schow@bnr-public.uucp (Stanley Chow)
Newsgroups: comp.arch
Subject: How to use silicon (was Re: Intel/MIPS Dhrystone ratio)
Summary: VAX ^= CISC
Keywords: Comlexity, speed, risc, cisc
Message-ID: <355@bnr-fos.UUCP>
Date: 19 Mar 89 14:57:57 GMT
References: <37196@bbn.COM> <1989Mar16.190043.23227@utzoo.uucp> <24889@amdcad.AMD.COM>
Sender: news@bnr-fos.UUCP
Reply-To: schow@bnr-public.UUCP (Stanley Chow)
Organization: Bell-Northern Research, Ottawa, Canada
Lines: 126

In a thread discussing what to do with all that transister sites 
becoming available on big chips, various suggestion of what is
good and not good has been brought on.

It is not often that I disagree with the likes of Tim Olson and
Henry Spencer, so I make a lot of splash when it happens. (Its
okey, I have my asbestas-suit).

In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>| >I predict that the next hardware features to come back will be
>| >auto-increment addressing and the hardware handling of unaligned data.
>| 
>| Again, why?  Auto-increment addressing is useful only if instructions
>| are expensive, because it sneaks two instructions into one.  However,
>| the trend today is just the opposite:  the CPUs are outrunning the
>| main memory.  Since instructions can be cached fairly effectively,
>| they are getting cheaper and data is getting more expensive.  Doing
>| the increment by hand often costs you almost nothing, because it can
>| be hidden in the delay slot(s) of the memory access.  Autoincrement
>| showed up best in tight loops, exactly where effective caching can be
>| expected to largely eliminate memory accesses for instructions.  Why
>| bother with autoincrement?
>
>Also, auto-incrementing addressing modes imply:
>
>	- Another adder (to increment the address register in parallel)
>
>	- Another writeback port to the register file
>
>Unless you wish to sequence the instruction over multiple cycles :-(
>
>I'm certain that most people can find something better to do with these
>resources than auto-increment.

For many people, auto-increment *is* something better!

The discussion is that with increasing density, the tendency is to add
complexity to the chips. There can be debates on the trade-off of 
different additions, but I would doubt that high performances chips of
the future (near future) will worry about another adder. Adding more
ports to register files is a challenge for the silicon groups, but, hay,
they got to earn their money too!

Caches (especially Data-cache) top out very easily. For many Unix-box
type application, we have already reached the point of vastly-diminished
return. Adding more cache won't bring your performance up.  (Or I could
talk about our application where the diminishing return starts *before*
you start adding cache.)

It is precisely because the CPU is running faster than memory (even 
cached memory) that one have to maximize the amount of work done in
each memory cycle.

Adding address modes does not mean a difficult machine to pipeline or
to design. Don't assume that all architecture with many addressig modes
will be as messy as the VAX instruction encoding.
It is quite possible to have good, clean instruction encoding
that has lots of modes - it just requires lots of gates to be *really*
fast.  Fortunately, lots of gates is exactly where we are headed.

With enough gates, the CPU will get more functional units, this means
small RISC instructions will not be able to keep all the functional
units busy. The i860 solves this by essentially have a short VLIW mode
(hmmm, Short Very Long?). It is also possible to have bigger instructions
that keep more units busy longer. Please note that different implementations
of the architecture can have a super-fast version that does all instructions
in single clock (at least in dispatching of instruction) and also have a 
cheap version that is (here it comes:) *micro-coded* or whatever.

Just because DEC could/would not do it for the VAX, don't conclude that
the concept is bad.

>
>| As for hardware handling of unaligned data, this is purely a concession
>| to slovenly programmers.  Those of us who have lived with alignment
>| restrictions all our professional lives somehow don't find them a problem.
>| Mips has done this right:  the *compilers* will emit code for unaligned
>| accesses if you ask them to, which takes care of the bad programs, while
>| the *machine* requires alignment.  High performance has always required
>| alignment, even on machines whose hardware hid the alignment rules.
>| Again, why bother doing it in hardware?
>
>The R2000/R3000 can also trap unaligned accesses and fix them up in a
>trap handler.  This is what the Am29000 does, as well.  This is mainly a
>backwards compatibility problem (FORTRAN equivalences, etc.) It is
>infrequent in newer code, mainly appearing in things like packed data
>structures in communication protocols.
>

It would be more accurate to call this "a concession to past constraints".
Remember, many of these old programs were writen in the days when memory
was not cheap and performance was expensive. It is not fair to call the
people "slovenly" just because you now have bigger and faster machines.
(If you were talking about some programmers who didn't know what they were
doing, them I agree with you.)

Even now, there are real money issues in memory alignments. If you have
a system with a 100 MegaBytes main memory and "correct" alignment makes 
it 150 MB, you have just made the system 50% more expensive. Or how about
alignment bumps your memory requirement from 63K to 65K causing extra chips
and possible board layout problems (not to mention the cost)?

Having the H/W be tolerant of alignment means a lot of flexibility in the
design trade-off.

Also, with more and more gates on a chip, it is conceivable that someone 
will put together a cache that can handle misalignment in the cache, as
long as the whole data item is in the same line. I.e., data can cross
word boundry with little or no penalty, but crosing line boundary will
be slow or disallowed.  With the trend to wider
buses (i.e., wider line size), this may well make the performance penalty
of mislaignment neglectable.


Stanley Chow  ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public
	      (613)  763-2831


Please don't tell Bell Northern Reaearch about these silly ideas, I have
them convinced that I know everything about processor architecture. They
are even paying me to work on it. [If I don't want to tell them; do you
think I could represent them?]
pay me to work