Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!mash
From: mash@mips.com (John Mashey)
Newsgroups: comp.arch
Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers)
Message-ID: <1659@spim.mips.COM>
Date: 31 Mar 91 06:31:31 GMT
References: <23189@as0c.sei.cmu.edu> <1991Mar27.193512.12417@cello.hpl.hp.com>
Sender: news@mips.COM
Organization: MIPS Computer Systems, Inc.
Lines: 116
Nntp-Posting-Host: winchester.mips.com

Note: people interested in this topic should especially consider attending
the ASPLOS panel run by Dave Patterson, which includes a panel and
audience discussion of several topics, including segmentation for >32-bits.

In article <1991Mar27.193512.12417@cello.hpl.hp.com> renglish@cello.hpl.hp.com (Bob English) writes:
...
>I would characterize such objects as belonging to three general types.

>The first is a large object accessed in a regular way, a large array or
>matrix, for example.  Segment loading and unloading in such an object
>will be rare, because the compiler will know the segment boundaries and
>be able to optimize them out of the code.
I don't quite understand this, but I could be convinced.  In fact, this
could lead to an interesting discussion.  Let me suggest the simplest
conceivable comparison, which is to take the inner loop of the rolled
DAXPY routine from linpack - code included later, but whose salient feature
is:
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
   30 continue
where dy,dx,da, and n all arrive to the code as arguments.
Maybe someone would post the likely code, for the loop above, for an
architecture with
segmentation (HP PA would be interesting, as the scheme seems generally
well-thought-out, and HP's compilers are good), for the following cases:
	1) Standard HP-UX, i.e., what do you get if you assume flat
	addressing? 
	2) What you would get, if dy and dx can be in separate segments,
	and neither is >4GB?  (easy case: just load up 2 segment regs,
	once).
	3) What you need to do in the general case, which is that either
	dx or dy, or both could be >4GB, or (enough to cause the problem)
	that either or both cross segment boundaries?
	(I think this code either takes the easy way out, and does
	2 segment manipulations per iteration, or else gets compiled into
	something much more complex, but I can be convinced.)
Recall that the likely situation to be faced is that some FORTRAN
programmer is told they can have bigger arrays, and they simply set the
sizes of the arrays up, recompile, and want it to work.  Note also, that
FORTRAN storage allocation has certain implications for what you can and
can't do regarding rearrangement of where data is.  (Also,
a question: I assume on HP PA implementations that Move-to-Space Register
instructions are 1-cycle operations, with no additional latency needed
before a load/store?  Hmm. Another question, since PA has 4 Space Registers
that user code can play with (I think), are there conventions for their
use, i.e., like callee-save - caller-save conventions for the regular
registers?  or are they all caller-save?  (I ask because the code for
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
   30 continue
AND
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
	call function(da)
   30 continue
could look rather different in their ability to just set the Space registers
and be done with it.

>The second is a large object accessed unpredictably with no locality. 
>While the compiler will not be able to predict the segmentation register
>in such cases, neither will the cache be able to hold the working set,
>so that miss penalties dominate the additional segment register loads.
Agreed.  If there is no locality, cache and TLB missing eats the machines.

>The third is a large object accessed unpredictably, but with a high
>degree of locality.  In such cases, loads and stores take up to one
>additional instruction.  Only in this case do segments make any
>difference in the performance of the machine, and even in this case the
>difference is small.  I don't claim to be an expert in such matters, but
>I suspect the number of applications fitting this last category is small.
DBMS, and other things that follow pointer chains around.

Conventional wisdom says that loads+stores are 30% of the code,
and so some subset of these incur at least 1 extra cycle.
However, I suspect that in the general case, you have to keep track
of the segment numbers, and pass them around, just like you do
on X86 with far pointers, and hence there are more instructions,
and in addition, need to keep the space numbers around in integer
registers for speed in some cases.  (Note that every pointer reference
is conceptually 64-bits, and hence, every pointer argumement needs 2
32-bit quantities, and probably close to 2X more instructions to set up.
Also, consider the code on a 32-bit machine for:
	*p = *q;
	where both p and q are pointer to pointers. and bot start in memory:
	this would typically look like (on typical 32-bit RISC):
	load r1,q
	load r2,p
	load r3,0(r1)
	store r3,0(r2)
I think this turns into, on smething like HP PA (but correct me if I'm wrong),
and assuming that c pointers turn into 64-bit things:

	load r1,q
	load r4,q+4	get SPACE ID
	movetospaceregister  r4,somewhere1
	load r2,p
	load r5,p+4	get SPACE ID
	movetospaceregister  r5,somewhere2
	load r3,0(r1)		and do whatever you have to to get somewher1
	load r6,4(r1)	get SPACE ID
	store r3,0(r2)	save the pointer; do what you must to get somewhere2
	store r6,4(r2)	save the SPACE ID

In this case, 4 instructions have turned into 10.  I wouldn't preend this
example is typical or not, and I'd expect compilers would do better,
but it is illustrative of what could happen.

Anyway, to get some serious analysis of this, I think one has to
look at code sequences under various assumptions, and see
	a) What speed is obtainable by perfect hand-code?
	b) How likely are compilers to get there?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086