Path: utzoo!utgpu!bnr-vpa!bnr-fos!news From: news@bnr-fos.UUCP (news) Newsgroups: comp.arch Subject: Re: How to use silicon (was Re: Intel/MIPS Dhrystone ratio) Message-ID: <370@bnr-fos.UUCP> Date: 23 Mar 89 23:55:40 GMT References: <355@bnr-fos.UUCP> <13@microsoft.UUCP> <16058@cup.portal.com> Reply-To: schow@bnr-public.UUCP (Stanley Chow) Organization: Bell-Northern Research, Ottawa, Canada Lines: 190 Summary: Expires: Sender: Followup-To: Distribution: Keywords: In article <13@microsoft.UUCP> w-colinp@microsoft.uucp (Colin Plumb) writes: >schow@bnr-public.UUCP (Stanley Chow) wrote: > >In any structure, if you rearrange the components, you can lose at most >n-1 bytes to padding, where n is the strictest alignment restriction. For >most processors, the worst case is a double and a char, 7 bytes out of 16 >wasted. But if this is a major concern, rewrite the code to use two parallel >arrays. You'll waste at most 7 bytes total (in your 100Meg). > I only wish this were true. Many many applications have "natural" data structures that are inconvenient to align. Using parallel arrays means more obscure code and more indexing time. These is of course other problems in multi-processing (like keeping different bits of a word from being written by different processors or processes). Typically, the worst problems come from many copies of a small data structure that is not a nice multiple of the word size. A million copies of a 33 bit structure wastes 4 Mega bytes. It is not possible to pack them together. For example, gate level simulation typically has an array of gate description with connectivity and state information. The natural (or logically clear) ordering of the fields is probably not the most compact ordering. >In C, this is a bit of a bother, but not too bad. I think requiring alignment >is one thing that'll never go out of style. On any chip, you want to do it >because it's more efficient, anyway. The only need for unaligned accesses is >to handle old data formats, which presumably need old programs run on them, >which will (except in pathological cases) run faster on the new machine >anyway. Ah, but this is precisely the point. Many old programs *need* misaligned accesses. If you don't allow that, the old programs will not run at all! Incidentally, the historical trend is to be progressively more tolerant of misalignment, e.g. IBM /360 /370, Motorola 68K families. All the "tolerant" machines always attach a *penalty* to misalignment. It is only the very recent crop of so-called RISC chips that is requiring alignment again. (Please note that I said historical *trend*, not *all* CPU families.) ------------------ In article <16058@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: ('>>' is Case quoting from my article) >>Caches (especially Data-cache) top out very easily. For many Unix-box >>type application, we have already reached the point of vastly-diminished >>return. Adding more cache won't bring your performance up. (Or I could >>talk about our application where the diminishing return starts *before* >>you start adding cache.) > >Maybe so, but I think that we have a little farther to go than 4K >instruction and 8K data, virtual no less. Even if we have 2 million >transistors, I don't think the caches are going to be "too big." > The point is the for small applictions, the existing workstations are already getting hit-rates in the high 90's. Some big application will thrash any cache. No matter how big. Some applications will only be happy with megabyte caches. I am not saying the caches will be too big, I am saying that there are different kinds of applications: - the small ones that are already running very fast with a 64K cache - the medium ones that will run faster with a bigger (say 256K) cache - the large ones that will not run any faster until you get to 128M By putting more cache on chip (and this obviously depends on the board-level cache and memory system), the small ones will not speed up, the large ones will not speed up. Only a portion of the medium applications will run faster by varying amounts. Depending on which applications you care about, the bigger cache may or may not be worth it to you. >>It is precisely because the CPU is running faster than memory (even >>cached memory) that one have to maximize the amount of work done in >>each memory cycle. > >"Running faster than memory" is a very misleading statement. Mabye >latencies are a problem, but bandwidth can be had in abundance. The >problem is what to do with it, not how to get it! By "the CPU running faster than memory", I mean that within the current semiconductor and PCB processes, we can build: - very fast ALU functions. - very fast on-chip register files. - not so fast on-chip cache (I & D). - slow off-chip (but on-board) memory. - very slow off-board memory. As a result, the problem is getting instructions into the pipeline, not the execution of it. Latency is a very bad problem, bandwidth is merely a bad problem. I like to know why you think "bandwidth can be had in abundance". > >>Adding address modes does not mean a difficult machine to pipeline or >>to design. Don't assume that all architecture with many addressig modes >>will be as messy as the VAX instruction encoding. > >It's true that not all architectures with many addressing modes will be as >messy as the VAX. However, you simply have to answer the question: are >those addressing modes, beyond register+register and register+offset, >really buying you anything? That is of course the $64K question. As other people have pointed out, the question is worth as least serious simulation. I suspect everyone will come up with different answers anyway, but is seems to me premature to dismiss it. > >>With enough gates, the CPU will get more functional units, this means >>small RISC instructions will not be able to keep all the functional >>units busy. The i860 solves this by essentially have a short VLIW mode > >I thought RISC instructions were too big! :-) :-) Note that the i860's >dual-instruction mode is essentially a VLIW mode. > Do I hear an echo here? :-) > >I claim a much better use of multiple functional units is to execute many >"small" RISC instructions at the same time, i.e., "super scalar" or >multiple instructions per clock. It just doesn't make sense to bundle, >bind is a better word, many operations into one instruction. Doing so >simply thwarts compiler optimization. Adresssing modes are probably the >worst form of semantic binding, in my opinion. So, if we are going to >have "too many transistors," we should use them to realize a superscalar >*implementation*, not a complex *architecture.* Since you believe that is lots of bandwidth to burn, your conclusions above are quite logical. Since I believe there is insufficient bandwidth now, I disagree. > >>Even now, there are real money issues in memory alignments. If you have >>a system with a 100 MegaBytes main memory and "correct" alignment makes >>it 150 MB, you have just made the system 50% more expensive. Or how about >>alignment bumps your memory requirement from 63K to 65K causing extra chips >>and possible board layout problems (not to mention the cost)? > >If you can't afford to go to 150 Mbytes (or more likely, paging) or you >can't afford to go to 65K of RAM (try getting such a small amount, I >challenge you), then performance must not be the most important thing. >By all means, then, you should allow un-naturally aligned data and you >can handle it in hardware or software, as you wish. If performance is >your first priority, which it pretty much is in everything but the >cheapest embedded systems (which is also accounts for the highest volume!), >then you *don't* want to allow un-natural alignment. > Am I the only one who worries about production cost and preformance/cost ratios? I thought only the US government gets to go for maximum performance at any cost. If I can put out a product at maximum performance with 150 Mbytes or another product at 95% performance with 100 MBbytes (thereby costing only 75%), how do you think the decision with go? Peformance may or may not be "the most important" criterion, I have never work in a project where performance is the *only* criterion. From: schow@bnr-public.uucp (Stanley Chow) Path: bnr-public!schow Stanley Chow ..!utgpu!bnr-vpa!bnr-fos!schow%bnr-public (613) 763-2831 I do not represent anyone except myself. Even then, I don't often let me represent myself.