Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!agate!helios.ee.lbl.gov!ncis.llnl.gov!lll-winken!uunet!portal!cup.portal.com!bcase From: bcase@cup.portal.com (Brian bcase Case) Newsgroups: comp.arch Subject: Re: How to use silicon (was Re: Intel/MIPS Dhrystone ratio) Message-ID: <16058@cup.portal.com> Date: 20 Mar 89 23:48:59 GMT References: <37196@bbn.COM> <1989Mar16.190043.23227@utzoo.uucp> <24889@amdcad.AMD.COM> <355@bnr-fos.UUCP> Organization: The Portal System (TM) Lines: 107 >..., but I would doubt that high performances chips of >the future (near future) will worry about another adder. Adding more >ports to register files is a challenge for the silicon groups, but, hay, >they got to earn their money too! Right, another adder is not a problem. Right, more ports on register files is not too much of a problem, but is adding a port just so that auto-increment goes fast the right thing? I say no. Keep reading. >Caches (especially Data-cache) top out very easily. For many Unix-box >type application, we have already reached the point of vastly-diminished >return. Adding more cache won't bring your performance up. (Or I could >talk about our application where the diminishing return starts *before* >you start adding cache.) Maybe so, but I think that we have a little farther to go than 4K instruction and 8K data, virtual no less. Even if we have 2 million transistors, I don't think the caches are going to be "too big." >It is precisely because the CPU is running faster than memory (even >cached memory) that one have to maximize the amount of work done in >each memory cycle. "Running faster than memory" is a very misleading statement. Mabye latencies are a problem, but bandwidth can be had in abundance. The problem is what to do with it, not how to get it! >Adding address modes does not mean a difficult machine to pipeline or >to design. Don't assume that all architecture with many addressig modes >will be as messy as the VAX instruction encoding. It's true that not all architectures with many addressing modes will be as messy as the VAX. However, you simply have to answer the question: are those addressing modes, beyond register+register and register+offset, really buying you anything? (BTW, the lack of even those doesn't seem to cripple the 29K too much.... If you dislike the 29K because of its lack of addressing modes, blame me. :-) >With enough gates, the CPU will get more functional units, this means >small RISC instructions will not be able to keep all the functional >units busy. The i860 solves this by essentially have a short VLIW mode I thought RISC instructions were too big! :-) :-) Note that the i860's dual-instruction mode is essentially a VLIW mode. >(hmmm, Short Very Long?). It is also possible to have bigger instructions >that keep more units busy longer. Please note that different implementations >of the architecture can have a super-fast version that does all instructions >in single clock (at least in dispatching of instruction) and also have a >cheap version that is (here it comes:) *micro-coded* or whatever. I claim a much better use of multiple functional units is to execute many "small" RISC instructions at the same time, i.e., "super scalar" or multiple instructions per clock. It just doesn't make sense to bundle, bind is a better word, many operations into one instruction. Doing so simply thwarts compiler optimization. Adresssing modes are probably the worst form of semantic binding, in my opinion. So, if we are going to have "too many transistors," we should use them to realize a superscalar *implementation*, not a complex *architecture.* >Even now, there are real money issues in memory alignments. If you have >a system with a 100 MegaBytes main memory and "correct" alignment makes >it 150 MB, you have just made the system 50% more expensive. Or how about >alignment bumps your memory requirement from 63K to 65K causing extra chips >and possible board layout problems (not to mention the cost)? If you can't afford to go to 150 Mbytes (or more likely, paging) or you can't afford to go to 65K of RAM (try getting such a small amount, I challenge you), then performance must not be the most important thing. By all means, then, you should allow un-naturally aligned data and you can handle it in hardware or software, as you wish. If performance is your first priority, which it pretty much is in everything but the cheapest embedded systems (which is also accounts for the highest volume!), then you *don't* want to allow un-natural alignment. >Having the H/W be tolerant of alignment means a lot of flexibility in the >design trade-off. ????? >Also, with more and more gates on a chip, it is conceivable that someone >will put together a cache that can handle misalignment in the cache, as >long as the whole data item is in the same line. I.e., data can cross The problem is not implenting hardware handling of misalignment; the problem is the performance implication. A mis-aligned load/store takes two accesses; a good compiler or programmer will know this and align the access whether the hardware can handle it or not. So what's the point of having the hardware? If data must be packed as tightly into memory as possible, then fine, but you must know that you are giving up performance. At this point, performance no longer is the first priority, so handling it in software is probably acceptable (with simple primitives like those of the MIPS processor, e.g.). >word boundry with little or no penalty, but crosing line boundary will >be slow or disallowed. With the trend to wider >buses (i.e., wider line size), this may well make the performance penalty >of mislaignment neglectable. I'm not sure how feasible it is to force the compiler/programmer to know whether or not data is going to cross a cache line. Things like dynamically-allocated data structures might be a problem; this would take some thought. But, without much thought needed, I do know that the hardware needed to permit misaligned access within a cache line is likely to make cache access slower. Since cache access is probably the limiting stage in integer pipelines (maybe not in FP, but maybe), this is not a good idea.