Path: utzoo!attcan!uunet!lll-winken!maddog!brooks
From: brooks@maddog.llnl.gov (Eugene D. Brooks III)
Newsgroups: comp.arch
Subject: Re: Compiler complexity (was: VAX Always Uses Fewer Instructions)
Keywords: RISC CISC
Message-ID: <8752@lll-winken.llnl.gov>
Date: 17 Jun 88 21:53:57 GMT
References: <6921@cit-vax.Caltech.Edu> <28200161@urbsdc> <10595@sol.ARPA> <8717@lll-winken.llnl.gov> <20338@beta.lanl.gov>
Sender: usenet@lll-winken.llnl.gov
Reply-To: brooks@maddog.UUCP (Eugene D. Brooks III)
Organization: Lawrence Livermore National Laboratory
Lines: 68

In article <20338@beta.lanl.gov> jlg@beta.lanl.gov (Jim Giles) writes:
>Any machine with fully pipelined memory access is faster (registers or not).

Me thinks the point has been missed here!

Its very simple.  Consider the code:
	a = b + c;
	e = f + g;
	h = i + j;
which is admittedly a contrivance, but is useful for illustration.
With a three register RISC:
	load r0,b;
	load r1,c;
	add  r2,r0,r1;
	load r0,f;
	load r1,g;
	store r2,a
	add  r2,r0,r1;
	load r0,i;
	load r1,j;
	store r2,e
	add  r2,r0,r1;
	store r2,j
Yes, I could do a little better with the code by allowing the add to destroy
an argument, but the basic point is that the low number of registers introduces
"false dependencies" in the register file which prevents re-ordering to take
advantage of pipelining.  With a lot of registers the code becomes:
	load r0,b;
	load r1,c;
	load r3,f;
	load r4,g;
	load r6,i;
	load r7,j;
	add  r2,r0,r1;
	add  r5,r3,r4;
	add  r8,r6,r7;
	store r2,a
	store r5,e
	store r8,j
which is just what the Cerberus compiler does with it.  This code rips along
with an issue rate of one instruction per clock and can mask quite a bit of
memory and functional unit latency.  One could say that we will just specify
that our CISC instruction set has fully pipelined memory access; and emit
the code
	add a,b,c
	add e,f,g
	add h,i,j
and it would go "just as fast", but just what must be hidden away in the
hardware?  Enough registers to hold all those temporaries in a way that
does not induce resource dependencies, not to mention a execution unit
(in hardware) which looks ahead many instructions to see of it can start
some of the operations in them.  We call this bug riddled complexity.
One might as well just expose those registers, they have to be in there.
One might also argue that we ought to take a WM approach and
fifo everything without registers; how then do we handle arbitrary arrival
order for the memory access, which frequently occurs in a multiprocessor.
ANSWER: Lots of REGISTERS so implement the fifos so stalling is prevented
as much as is possible.
	
Some careful thought about this will reveal why the ETA 10, which uses the
CISC like memory to memory instructions for its vector operations, can't
support these operations directly to the SSD (they call it a shared memory).
They can only support (load/store like) stride 1 vector accesses when copy
in and out from the local memory to the SSD.

Flames?  Sure, I eat fire for lunch.


							Eugene