Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!ames!amdcad!neptune!david
From: david@neptune.AMD.COM
Newsgroups: comp.arch
Subject: Re: register windows
Message-ID: <479@neptune.AMD.COM>
Date: Tue, 10-Nov-87 10:57:34 EST
Article-I.D.: neptune.479
Posted: Tue Nov 10 10:57:34 1987
Date-Received: Thu, 12-Nov-87 21:35:43 EST
References: <230@usl-pc.UUCP> <6681@apple.UUCP>
Sender: david@neptune.AMD.COM
Reply-To: david@neptune.AMD.COM (David Witt)
Organization: Advanced Micro Devices, Inc., Austin, Texas
Lines: 37

In article <6681@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
>Ok, so to address future speed advantages, yes there might be some speed
>advantages for those with simple register files.  However, for the Am29000,
>the critical paths were quite balanced (Dave Witt, are you out there?)
>with, I believe, the TLB and/or instruction cache being the limiting
>factor.  Next came the ALU, and then the register file.  Unless you want
>to do things like spread the ALU cost over two pipestages (possible to do),
>I don't think the register file is going to be the limiting factor.

	well, since my friend bcase requested a response from me, on the 29k 
	design the stack relative add was one of the speed paths 
	encountered on the part, but certainly no worse that the
	64-32 funnel shift or worst case alu adds, tlb translation
	or conditional jump and read from the branch target cache.

	Specifically for that particular path, in one half clock phase,
	the internal pipe was required to discharge the instruction
	bus and statically add the stack pointer to the a,b,c offset
	in three separate 7-bit adders.  In parallel, a zero detect
	and a check on the msb of the a,b,c values determined the
	selection in a 3:1 multiplexor to enable the stack-relative
	local register, the global registers, or the indirect pointers.
	
	The output of the multiplexor was the address for the row/column
	decode for the 3-port register file which would be locally
	decoded and accessed in the next half clock phase for a double
	read. (the write is obviously delayed due to the internal pipe
	and therefore not a speed path).  The total amount of gate
	delays for this path (including a small amount of lookahead
	for the adder) was 12 gates.  In initial silicon at nominal
	temp this worst case path was passing in excess of 35mhz.

	In my opinion, in our internal pipe, it was not a major concern
	in terms of designing in this functionality.