Path: utzoo!attcan!uunet!seismo!sundc!pitstop!sun!amdcad!crackle!tim
From: tim@crackle.amd.com (Tim Olson)
Newsgroups: comp.arch
Subject: Re: Register Windows (was Re: Japanese...)
Message-ID: <23150@amdcad.AMD.COM>
Date: 6 Oct 88 23:21:48 GMT
References: <58@zeno.MN.ORG> <91@zeno.MN.ORG> <ANDREW.88Sep28160417@jung.ha <9725@cup.portal.com> <287@granite.dec.com>
Sender: news@amdcad.AMD.COM
Reply-To: tim@crackle.amd.com (Tim Olson)
Organization: Advanced Micro Devices, Inc. Sunnyvale CA
Lines: 95
Summary:
Expires:
Sender:
Followup-To:

In article <287@granite.dec.com> jmd@granite.UUCP (John Danskin) writes:
| By the way:
| 	There is a paper:
| 		"Register Windows Vs. General Registers: A Comparison of
| 		Memory Access Patterns" by Scott Morrison and Nancy Walker
| 		of UC Berkeley.
| 
| 	Which shows that the MIPS R2000 (aside from running faster) achieves
| 	fewer memory references (in almost all cases) than SPARC with all
| 	levels of optimization and as many as 7 register windows.

This says fewer overall memory references, but what is missing here is
the ratio of loads and stores to the rest of the instruction mix.  I
wouldn't be surprised if it is just that the Sun compiler is not doing
as good a job in general, and so the total number of instructions
(including loads and stores) increased with respect to the MIPS
compiler.  However, the number of loads and stores as a percentage of
the instruction mix might be lower. 

| 	a) Does anyone know if/where (Earl?) this paper was published?
| 	(I got a copy from MIPS people, they love to give it away).
| 
| 	b) Does anybody at SUN have an answer (tell us how they got it all
| 	wrong, register windows really DO save memory references).
| 
| 	c) Anybody at AMD (Tim?) want to say something about how burst
| 	read/write makes the extra references OK?

Well, I don't know what SUN seems to be doing wrong, but let's try this:
bsd 4.3 nroff with the 4.3 libraries running

	nroff /usr/doc/misc/sysperf/2.t	[a 10655 byte file]

results in:

---------- Pipeline ----------
 32.63% idle pipeline:
	 18.39% Instruction Fetch Wait
	 11.44% Data Transaction Wait
	  0.69% Page Boundary Crossing Fetch Wait
	  0.01% Unfilled Cache Fetch Wait
	  0.00% Load/Store Multiple Executing	<-- Hmm, not much time here!
	  2.07% Load/Load Transaction Wait
	  0.03% Pipeline Latency

---------- Bus Utilization ----------
Inst Bus Utilization:	 63.97%
	 8669133 Instruction Fetches

Data Bus Utilization:	  9.75%
	  979830 Loads
	  340998 Stores

---------- Instruction Mix ----------
	  1.86% Calls
	 15.65% Jumps
	 10.73% Loads
	  3.74% Stores
	  4.33% No-ops

---------- Register File Spilling/Filling ----------
	       3 Spills				<-- this is why
	       0 Fills

Spill/Fill sizes:
   1 registers:         0 time(s) (  0.00%)
   2 registers:         1 time(s) ( 33.33%)
   3 registers:         0 time(s) (  0.00%)
   4 registers:         1 time(s) ( 33.33%)
   5 registers:         0 time(s) (  0.00%)
   6 registers:         0 time(s) (  0.00%)
   7 registers:         0 time(s) (  0.00%)
   8 registers:         0 time(s) (  0.00%)
   9 registers:         0 time(s) (  0.00%)
  10 registers:         0 time(s) (  0.00%)
  11 registers:         0 time(s) (  0.00%)
  12 registers:         1 time(s) ( 33.33%)
  13 registers:         0 time(s) (  0.00%)
  14 registers:         0 time(s) (  0.00%)
  15 registers:         0 time(s) (  0.00%)
  16 registers:         0 time(s) (  0.00%)
> 16 registers:         0 time(s) (  0.00%)

So for the entire nroff run, we wrote a total of 18 words out to memory
due to stack cache overflow.  And we loaded 0 words from the stack due
to underflow (this is because nroff exits() while it is still a few
procedures down in the overall call chain).

I would be interested in seeing how this compares to a
non-register-windowed processor, in particular the total number of
loads/stores, the loads/stores as a percentage of instruction mix, and
the number of words of scalar data transfered to/from the stack.
	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)