Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!sdd.hp.com!hplabs!hpcc05!aspen!huck
From: huck@aspen.IAG.HP.COM (Jerry Huck)
Newsgroups: comp.arch
Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers)
Message-ID: <1360009@aspen.IAG.HP.COM>
Date: 2 Apr 91 01:48:40 GMT
References: <efeustel.669650766@tiger1>
Organization: HP Information Architecture Group - Cupertino, CA
Lines: 206

Let me try to explain some the ways PA-RISC is used by HP-UX and its
relationship to segmentation.  But first a couple of notes on PA-RISC
segmentation.

PA-RISC uses segmentation to extend the addressability of the
normal general register file.  It is not a partition of these
registers into pieces.  Segments are 2^32 in size and give
capability in several areas.  At the point when register sizes
increase (such as the R4000 path) one expects the segmentation size
to increase.  The crucial tradeoffs are in silicon area for register
files, datapaths, and ALUs, that is, the pieces of the CPU that must be
increased to accommidate larger flat addressing.

So for HP, segmentation was not a trade-off against flat addressing,
but rather: is it useful to extend beyond the maximum flat addressing
you can support in your general register file?  At the time,
1982-1983, 32-bit general registers gave at least a ten year horizon.
Wider registers would have resulted in non-competitive machines in the
existing technology.

I think most of the arguments against segmentation assume you give up
some flat addressing to get it.  That's not necessary.

The inclusion of segmentation offered an efficient scheme to extend
addressability with little hardware cost.  All the hardware support
for this extended addressing is well partitioned in the TLB control
with no worse cycle-time cost than process ID extensions found in per
process TLBs (we assumed flushing the TLB on context switch is to be
avoided).

The primary benefactors are the OS and database subsystems.  The
presence of segmentation (what we call long addresses) is not exposed
to the programs (not to mention that languages have no way to talk
about segmentation).  We find many situations were objects remain
<2^32 in size yet the aggregrate space greatly exceeds 2^32.  Larger
objects can be managed if some additional structure exists.  For
example, a large database can span multiple segments when all database
accesses deal with page size buckets (not uncommon).  There are many
ways to solve all these problems; we found segmentation in PA-RISC to
very effective in dealing with these applications.


>In comp.arch, mash@mips.com (John Mashey) writes:
>  In article <1991Mar27.193512.12417@cello.hpl.hp.com> renglish@cello.hpl.hp.com (Bob English) writes:
>  ...
>  >I would characterize such objects as belonging to three general types.

>  >The first is a large object accessed in a regular way, a large array or
>  >matrix, for example.  Segment loading and unloading in such an object
>  >will be rare, because the compiler will know the segment boundaries and
>  >be able to optimize them out of the code.
>  I don't quite understand this, but I could be convinced.  In fact, this
>  could lead to an interesting discussion.  Let me suggest the simplest
>  conceivable comparison, which is to take the inner loop of the rolled
>  DAXPY routine from linpack - code included later, but whose salient feature
>  is:
>        do 30 i = 1,n
>          dy(i) = dy(i) + da*dx(i)
>     30 continue
>  where dy,dx,da, and n all arrive to the code as arguments.
>  Maybe someone would post the likely code, for the loop above, for an
>  architecture with
>  segmentation (HP PA would be interesting, as the scheme seems generally
>  well-thought-out, and HP's compilers are good), for the following cases:

In general, you would not attempt to let objects (especially fortran arrays)
span segment (what we call space) boundaries and generate run-time checks for
crossing.  As suggested above, we generally confine normal objects to a single
flat space of 32 bits.

>  	1) Standard HP-UX, i.e., what do you get if you assume flat
>  	addressing? 

Nothing unusual.  The normal loads and stores one normally expects.  HP-UX
only presents the short (roughly flat) addressing mode to the user.  There's
a little complication with short addressing that might create short pointer
to long pointer conversions (2 instructions) when the compiler is not sure
if zero based array addressing would wrap into another short pointer quadrant.

>  	2) What you would get, if dy and dx can be in separate segments,
>  	and neither is >4GB?  (easy case: just load up 2 segment regs,
>  	once).

On HP-UX this is speculation mode since we don't support it.  But if we did,
then the sequence would be something like:
           <load up the long pointers>
           <move the segment number of dy in one of four segment registers>
           <move the segment number of dx in one of four segment registers>
           <any other loop setup stuff: trip counts, indexes...>
       loop:
           fldws,ma  8(segmentdxreg,dxbasereg),dxreg  ;get value and skip to next
           fldws     (segmentdyreg,dybasereg),dyreg   ;get value
	   fmul,dbl  dareg,dxreg,mulreg   
           fadd,dbl  mulreg,dyreg,dyreg
           addib,<   1,tripcount,loop
           fstws,ma  dyreg,8(segmentdyreg,dybasereg)


>  	3) What you need to do in the general case, which is that either
>  	dx or dy, or both could be >4GB, or (enough to cause the problem)
>  	that either or both cross segment boundaries?
>  	(I think this code either takes the easy way out, and does
>  	2 segment manipulations per iteration, or else gets compiled into
>  	something much more complex, but I can be convinced.)

As suggested earlier, this is not what we use segmentation for.  If
you need > 32 bit indexes you probably need > 32 bit registers.  If
common objects are bigger than 2^32 bytes, then you would want > 32
bit flat addressing.  At least simulating this on PA-RISC would be
faster than any other shipping RISC microprocessor :-).  (Well at
least SPARC, MIPS, 88K, and RS6000).  Of course that doesn't matter,
if it's important, you'll want flat addressing that does it more
simply.

>  Recall that the likely situation to be faced is that some FORTRAN
>  programmer is told they can have bigger arrays, and they simply set the
>  sizes of the arrays up, recompile, and want it to work.  Note also, that
>  FORTRAN storage allocation has certain implications for what you can and
>  can't do regarding rearrangement of where data is.  (Also,
>  a question: I assume on HP PA implementations that Move-to-Space Register
>  instructions are 1-cycle operations, with no additional latency needed
>  before a load/store?  Hmm. 

I'm not sure on that.  I would not spend much silicon making that superfast
given the typical use.

>                              Another question, since PA has 4 Space Registers
>  that user code can play with (I think), are there conventions for their
>  use, i.e., like callee-save - caller-save conventions for the regular
>  registers?  or are they all caller-save?  (I ask because the code for

sr0,sr1,sr2 are caller saves,
sr3,sr4 are callee saves, and
sr5, sr6, sr7 are managed by the OS and not writable by the user.

>  >The second is a large object accessed unpredictably with no locality. 
>  >While the compiler will not be able to predict the segmentation register
>  >in such cases, neither will the cache be able to hold the working set,
>  >so that miss penalties dominate the additional segment register loads.
>  Agreed.  If there is no locality, cache and TLB missing eats the machines.

>  >The third is a large object accessed unpredictably, but with a high
>  >degree of locality.  In such cases, loads and stores take up to one
>  >additional instruction.  Only in this case do segments make any
>  >difference in the performance of the machine, and even in this case the
>  >difference is small.  I don't claim to be an expert in such matters, but
>  >I suspect the number of applications fitting this last category is small.
>  DBMS, and other things that follow pointer chains around.

>  Conventional wisdom says that loads+stores are 30% of the code,
>  and so some subset of these incur at least 1 extra cycle.
>  However, I suspect that in the general case, you have to keep track
>  of the segment numbers, and pass them around, just like you do
>  on X86 with far pointers, and hence there are more instructions,
>  and in addition, need to keep the space numbers around in integer
>  registers for speed in some cases.  (Note that every pointer reference
>  is conceptually 64-bits, and hence, every pointer argumement needs 2
>  32-bit quantities, and probably close to 2X more instructions to set up.
>  Also, consider the code on a 32-bit machine for:
>  	*p = *q;
>  	where both p and q are pointer to pointers. and bot start in memory:
>  	this would typically look like (on typical 32-bit RISC):
>  	load r1,q
>  	load r2,p
>  	load r3,0(r1)
>  	store r3,0(r2)
>  I think this turns into, on smething like HP PA (but correct me if I'm wrong),
>  and assuming that c pointers turn into 64-bit things:

>  	load r1,q
>  	load r4,q+4	get SPACE ID
>  	movetospaceregister  r4,somewhere1
>  	load r2,p
>  	load r5,p+4	get SPACE ID
>  	movetospaceregister  r5,somewhere2
>  	load r3,0(r1)		and do whatever you have to to get somewher1
>  	load r6,4(r1)	get SPACE ID
>  	store r3,0(r2)	save the pointer; do what you must to get somewhere2
>  	store r6,4(r2)	save the SPACE ID

>  In this case, 4 instructions have turned into 10.  I wouldn't preend this
>  example is typical or not, and I'd expect compilers would do better,
>  but it is illustrative of what could happen.

Alternatively, any reuse of the pointer avoids the movetospace
operations when dealing with 32bit objects.  Any looping or database
like access to records would also avoid the overhead.

>  Anyway, to get some serious analysis of this, I think one has to
>  look at code sequences under various assumptions, and see
>  	a) What speed is obtainable by perfect hand-code?
>  	b) How likely are compilers to get there?

I'm not sure what "this" is but one would certainly not propose
segmentation as the mechanism to address common array objects that
exceed the flat addressability of the machine.  Nor would you use
32bit load instructions when the primary pointer size was > 32 bits
(not that John was).  It would be similar to an architecture that only
only allowed loading 32 bit floating-point variables :-).  HP-UX and the
proprietary MPE/XL operating systems make use of long pointers as well
as some of our database vendors.  It is very convenient to be able to
directly access > 2^32 bytes without operating system involvement.
Just don't get carried away with it.

Jerry Huck
Hewlett Packard