Path: utzoo!utgpu!bnr-vpa!bnr-fos!bnr-public!schow From: schow@bnr-public.uucp (Stanley Chow) Newsgroups: comp.arch Subject: How to use silicon (was Re: Intel/MIPS Dhrystone ratio) Summary: VAX ^= CISC Keywords: Comlexity, speed, risc, cisc Message-ID: <355@bnr-fos.UUCP> Date: 19 Mar 89 14:57:57 GMT References: <37196@bbn.COM> <1989Mar16.190043.23227@utzoo.uucp> <24889@amdcad.AMD.COM> Sender: news@bnr-fos.UUCP Reply-To: schow@bnr-public.UUCP (Stanley Chow) Organization: Bell-Northern Research, Ottawa, Canada Lines: 126 In a thread discussing what to do with all that transister sites becoming available on big chips, various suggestion of what is good and not good has been brought on. It is not often that I disagree with the likes of Tim Olson and Henry Spencer, so I make a lot of splash when it happens. (Its okey, I have my asbestas-suit). In article <24889@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <1989Mar16.190043.23227@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >| In article <37196@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >| >I predict that the next hardware features to come back will be >| >auto-increment addressing and the hardware handling of unaligned data. >| >| Again, why? Auto-increment addressing is useful only if instructions >| are expensive, because it sneaks two instructions into one. However, >| the trend today is just the opposite: the CPUs are outrunning the >| main memory. Since instructions can be cached fairly effectively, >| they are getting cheaper and data is getting more expensive. Doing >| the increment by hand often costs you almost nothing, because it can >| be hidden in the delay slot(s) of the memory access. Autoincrement >| showed up best in tight loops, exactly where effective caching can be >| expected to largely eliminate memory accesses for instructions. Why >| bother with autoincrement? > >Also, auto-incrementing addressing modes imply: > > - Another adder (to increment the address register in parallel) > > - Another writeback port to the register file > >Unless you wish to sequence the instruction over multiple cycles :-( > >I'm certain that most people can find something better to do with these >resources than auto-increment. For many people, auto-increment *is* something better! The discussion is that with increasing density, the tendency is to add complexity to the chips. There can be debates on the trade-off of different additions, but I would doubt that high performances chips of the future (near future) will worry about another adder. Adding more ports to register files is a challenge for the silicon groups, but, hay, they got to earn their money too! Caches (especially Data-cache) top out very easily. For many Unix-box type application, we have already reached the point of vastly-diminished return. Adding more cache won't bring your performance up. (Or I could talk about our application where the diminishing return starts *before* you start adding cache.) It is precisely because the CPU is running faster than memory (even cached memory) that one have to maximize the amount of work done in each memory cycle. Adding address modes does not mean a difficult machine to pipeline or to design. Don't assume that all architecture with many addressig modes will be as messy as the VAX instruction encoding. It is quite possible to have good, clean instruction encoding that has lots of modes - it just requires lots of gates to be *really* fast. Fortunately, lots of gates is exactly where we are headed. With enough gates, the CPU will get more functional units, this means small RISC instructions will not be able to keep all the functional units busy. The i860 solves this by essentially have a short VLIW mode (hmmm, Short Very Long?). It is also possible to have bigger instructions that keep more units busy longer. Please note that different implementations of the architecture can have a super-fast version that does all instructions in single clock (at least in dispatching of instruction) and also have a cheap version that is (here it comes:) *micro-coded* or whatever. Just because DEC could/would not do it for the VAX, don't conclude that the concept is bad. > >| As for hardware handling of unaligned data, this is purely a concession >| to slovenly programmers. Those of us who have lived with alignment >| restrictions all our professional lives somehow don't find them a problem. >| Mips has done this right: the *compilers* will emit code for unaligned >| accesses if you ask them to, which takes care of the bad programs, while >| the *machine* requires alignment. High performance has always required >| alignment, even on machines whose hardware hid the alignment rules. >| Again, why bother doing it in hardware? > >The R2000/R3000 can also trap unaligned accesses and fix them up in a >trap handler. This is what the Am29000 does, as well. This is mainly a >backwards compatibility problem (FORTRAN equivalences, etc.) It is >infrequent in newer code, mainly appearing in things like packed data >structures in communication protocols. > It would be more accurate to call this "a concession to past constraints". Remember, many of these old programs were writen in the days when memory was not cheap and performance was expensive. It is not fair to call the people "slovenly" just because you now have bigger and faster machines. (If you were talking about some programmers who didn't know what they were doing, them I agree with you.) Even now, there are real money issues in memory alignments. If you have a system with a 100 MegaBytes main memory and "correct" alignment makes it 150 MB, you have just made the system 50% more expensive. Or how about alignment bumps your memory requirement from 63K to 65K causing extra chips and possible board layout problems (not to mention the cost)? Having the H/W be tolerant of alignment means a lot of flexibility in the design trade-off. Also, with more and more gates on a chip, it is conceivable that someone will put together a cache that can handle misalignment in the cache, as long as the whole data item is in the same line. I.e., data can cross word boundry with little or no penalty, but crosing line boundary will be slow or disallowed. With the trend to wider buses (i.e., wider line size), this may well make the performance penalty of mislaignment neglectable. Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public (613) 763-2831 Please don't tell Bell Northern Reaearch about these silly ideas, I have them convinced that I know everything about processor architecture. They are even paying me to work on it. [If I don't want to tell them; do you think I could represent them?] pay me to work