Path: utzoo!attcan!uunet!lll-winken!maddog!brooks From: brooks@maddog.llnl.gov (Eugene D. Brooks III) Newsgroups: comp.arch Subject: Re: Compiler complexity (was: VAX Always Uses Fewer Instructions) Keywords: RISC CISC Message-ID: <8752@lll-winken.llnl.gov> Date: 17 Jun 88 21:53:57 GMT References: <6921@cit-vax.Caltech.Edu> <28200161@urbsdc> <10595@sol.ARPA> <8717@lll-winken.llnl.gov> <20338@beta.lanl.gov> Sender: usenet@lll-winken.llnl.gov Reply-To: brooks@maddog.UUCP (Eugene D. Brooks III) Organization: Lawrence Livermore National Laboratory Lines: 68 In article <20338@beta.lanl.gov> jlg@beta.lanl.gov (Jim Giles) writes: >Any machine with fully pipelined memory access is faster (registers or not). Me thinks the point has been missed here! Its very simple. Consider the code: a = b + c; e = f + g; h = i + j; which is admittedly a contrivance, but is useful for illustration. With a three register RISC: load r0,b; load r1,c; add r2,r0,r1; load r0,f; load r1,g; store r2,a add r2,r0,r1; load r0,i; load r1,j; store r2,e add r2,r0,r1; store r2,j Yes, I could do a little better with the code by allowing the add to destroy an argument, but the basic point is that the low number of registers introduces "false dependencies" in the register file which prevents re-ordering to take advantage of pipelining. With a lot of registers the code becomes: load r0,b; load r1,c; load r3,f; load r4,g; load r6,i; load r7,j; add r2,r0,r1; add r5,r3,r4; add r8,r6,r7; store r2,a store r5,e store r8,j which is just what the Cerberus compiler does with it. This code rips along with an issue rate of one instruction per clock and can mask quite a bit of memory and functional unit latency. One could say that we will just specify that our CISC instruction set has fully pipelined memory access; and emit the code add a,b,c add e,f,g add h,i,j and it would go "just as fast", but just what must be hidden away in the hardware? Enough registers to hold all those temporaries in a way that does not induce resource dependencies, not to mention a execution unit (in hardware) which looks ahead many instructions to see of it can start some of the operations in them. We call this bug riddled complexity. One might as well just expose those registers, they have to be in there. One might also argue that we ought to take a WM approach and fifo everything without registers; how then do we handle arbitrary arrival order for the memory access, which frequently occurs in a multiprocessor. ANSWER: Lots of REGISTERS so implement the fifos so stalling is prevented as much as is possible. Some careful thought about this will reveal why the ETA 10, which uses the CISC like memory to memory instructions for its vector operations, can't support these operations directly to the SSD (they call it a shared memory). They can only support (load/store like) stride 1 vector accesses when copy in and out from the local memory to the SSD. Flames? Sure, I eat fire for lunch. Eugene