Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!lll-lcc!styx!ames!amdcad!bcase From: bcase@amdcad.AMD.COM (Brian Case) Newsgroups: comp.arch Subject: Re: Japanese 32-bit CPUs ( NEC V70 ) Message-ID: <16561@amdcad.AMD.COM> Date: Wed, 6-May-87 14:09:13 EDT Article-I.D.: amdcad.16561 Posted: Wed May 6 14:09:13 1987 Date-Received: Sat, 9-May-87 00:56:43 EDT References: <3810030@nucsrl.UUCP> <491@necis.UUCP> <3530@spool.WISC.EDU> <4016@necntc.NEC.COM> Reply-To: bcase@amdcad.UUCP (Brian Case) Organization: Advanced Micro Devices, Inc., Sunnyvale, Ca. Lines: 96 Keywords: V60, V70 In article <4016@necntc.NEC.COM> pec@necntc.UUCP (Paul Cohen) writes: >In article <1157@cottage.WISC.EDU> lm@cottage.WISC.EDU (Larry McVoy) writes: >No doubt about it, the V70 is a complex chip; it is also fast. It packs >a great deal of functionality to provide high performance at a reasonable >cost in a real system. > >>It looks like the 29K may have made some smart moves.... > >It depends on its objectives. The 29K requires two separate >paths to memory, one for code and another for data. The memory must be >extremely fast (read expensive) to service the CPU without wait states. >It also expects some specialized bus monitoring hardware in the memory >system: > > >From: tim@amdcad.AMD.COM (Tim Olson) > >Subject: Re: AM29000 memory management (was flame) > > >The "best" place for the referenced and changed bits, however, are in > >an external memory array, which "watches" the bus and automatically > >updates the R & C bits. This array can also be read from or written > >to via I/O space to read or clear the bits. > >Also, taking advantage of the AMD RISC style architecture places some >uncomfortable demands on compiler developers. > >I'm not knocking the AMD part. It is an interesting processor and I'll >be interested in seeing what it does in a real system but I'll also be >interested in seeing what such a system costs. > >If system cost does is no concern to you then disregard my comments but >if a good cost/performance ratio seems important to you (not to mention >V30 software compatibility) I suggest that you take another look at the >V70 and the V60. First, the memory system for the Am29000 may be as simple as VideoDRAM. These VDRAMs are, I believe, only marginally more expensive than regular DRAMs and allow the Am29000 to deliver a fair fraction of its maximum performance. I know of one potential customer who is simulating the Am29000 with VDRAMs and is quite satisfied with the results (frankly, I was very surprised at the peformance, but this may be an isolated case). Let's face it: you can try lots of stuff with instruction set encoding, pipelinging tricks, etc. etc., but in the end, the performance of the CPU comes down to that of the memory hierarchy. As the designers of the Am29000, we recognized this fact and did what we could to *solve* the problem instead of trying our best to *hide* the problem in highly- encoded instruction formats. To get the best performance from the Am29000 probably *does* require an expensive memory system; we look at it this way: at least the Am29000 gives the system designer a *chance* to get superior performance. We feel we have given the designer the "Max Headroom." :-) :-) Second, the Am29000 does not "expect" some sophisticated bus monitoring hardware. Maintaining referenced and modified bits in hardware associated with the memory arrays is what we consider the *best* way; since the TLB reload routines for the Am29000 can be tailored to specific needs, it is, of course, quite possible to maintain this information by software means. However, there is a performance cost associated. Even if the TLB reload is done by "hardware" (really microcode or some state machine) on the CPU chip, there is a performance cost. Referenced and modified bits in hardware next to the memory arrays is probably the best for multiprocessor systems too. But there are *lots* of specific tradeoffs to make for a particular system; again we feel that we have given the "Max Headroom" since a designer may choose to maintain referenced and modified information wherever he (she?) chooses. When the TLB reload and other VM tasks are done by fixed routines/state machines in "hardware", there can be problems. Thirdly, I don't know what architectural features of the Am29000 are considered to place uncomfortable demands on compiler writers. Overlapped loads and stores are there to be taken advantage of (and have been demonstrated, by one customer on one graphics benchmark, to be worth nearly a factor of two in performance (but I don't think this will be the case most of the time)) if possible; the Am29000 interlocks to insure correct operation when full overlap isn't possible. Delayed branches must be dealt with by software constructors (be they human beings or compilers), but this is not a big deal (in fact it is, I believe, one of the simplest optimizations to perform). For the Am29000, using the local register file as a stack cache can make register allocation easy. Three address register-register instructions and the load/store architecture make code generation easy. The kinds of optimizations that are important for reaping maximum performance from the Am29000 are the same ones that are important for reaping maximum performance from any architecture: loop optimizations, common sub- expression elimination, induction variable elimination, strength reductions, etc. etc. We believe that the Am29000 makes these optimizations *easier* not more difficult. I believe that most of the members of the compiler-writing and architecture community would agree that a simple architecture with a predictable cost for instructions (in both time and space) is the best match for automatic code generation. I wouldn't mind if some of you in the compiler-writing and architecture community (and OS community too, sorry John) would come to my aid. I am not trying to say anything bad about the V70. I just want to set the record straight about the Am29000. bcase