Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!cello!renglish
From: renglish@cello.hpl.hp.com (Bob English)
Newsgroups: comp.arch
Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers)
Message-ID: <1991Apr03.193215.16441@cello.hpl.hp.com>
Date: 3 Apr 91 19:32:15 GMT
References: <1659@spim.mips.COM>
Organization: Hewlett Packard Labs
Lines: 124

I want to make something clear up front.  I am not trying to convince
the world at large that segmentation is a better way of providing a
large address space to a single program than a linear address space with
register size equalling the address size.  Neither am I trying to take a
position on the best use of current silicon space or the minimum usable
address space.  What I take issue with is the opinion, expressed many
times in comp.arch, that segmentation is inherently wrong, violates all
principles of good design, and implies severe brain damage on the part
of the designers.

The point that I'm trying to make is that segmentation at the hardware
level, or the lack thereof, is not an issue of architectural principle,
but a design choice with a set of costs and benefits.  Elevating it to a
principle implies that the only acceptable address space is infinite,
because no programmer should ever have to worry about addressability.
At any point, it's a choice between the costs of extending the address
space (register size, etc.) and the benefits derived from doing so, as
well as a choice of system level to provide the service.

mash@mips.com (John Mashey) writes:
> ...take the inner loop of the rolled DAXPY routine from linpack...:
> 	3) What you need to do in the general case, which is that either
> 	dx or dy, or both could be >4GB, or (enough to cause the problem)
> 	that either or both cross segment boundaries?

Well, this is a bit longer than the code Jerry sent out for the current
case, but it isn't too complicated.  It's 30 instructions, two or three
times that of the unsegmented code posted earlier (after initialization
is added to the earlier code), but the inner loop is unchanged.  In a
machine where the compilers dealt effectively with segments, this would
be a normal form for striding through arrays, and would be highly
optimized (at least as good as this).

Evaluating the performance impact is a bit trickier.  The inner loop is
unchanged, but the set up costs are higher.  For long loops, this is
inconsequential.  For short loops, it adds about 14 cycles to the loop,
or about 12% for a vector length of 20 (there are probably ways to
reduce those costs for short vectors without appreciably increasing the
overhead for long vectors, but that's not important).

How important is this increased overhead?  It seems counterintuitive
that programs demanding objects greater than 32 bits would have their
performance dominated by small vectors, but it could be true.  With one
DAXPY to a 2^^32 array, there would have to be 200 million DAXPYs to
twenty element arrays before the 12% difference in short loop
performance became a 6% increase in actual performance.  If those
accesses were themselves in a loop, and global optimizations were
performed, the overhead would drop way down.

The code:

	mtsr	dysegshadow,segmentdyreg
	mtsr	dxsegshadow,segmentdxreg
	; This section eliminates long (> 2^^30) internal runs to simplify
	; the later tests.  "ocnt" gets the projected run size for the
	; inner loop.
	zdepi   3,1,2,maxrun			  ; set up max run
oloop0:	
	combt,<<       gcnt,maxrun,lessmax	  ; nullifies on gcnt << maxrun
	copy,tr	       maxrun			  ; always nullifies
lessmax:
	copy	gcnt,ocnt			  ; nullified if dropped in

	; This section checks for segmentation wraps, so that the inner loop
	; won't have to. "icnt" gets the maximum base register, and then
	; the actual inner loop count.
oloop1:
	comclr,<<=	dxbasereg,dybasereg,r0	  ; which base is higher?
	or,tr	dxbasereg,r0,icnt		  ; or,tr always nullifies
	or	dybasereg,r0,icnt		  ; this instruction
	sh3add	ocnt,icnt,tmp1			  ; will the higher base wrap?
	combf,<<,n	tmp1,icnt,iloopstart	  ;
	subi	7,icnt,icnt			  ; reduce the inner loop cnt
	extrs,tr	icnt,1C,1D,icnt		  ; to the wrap point
iloopstart:
	or	ocnt,r0,icnt
	subi	0,icnt,tripcount

	; This is the inner loop, same as without segments.
iloop:
	fldws,ma  8(segmentdxreg,dxbasereg),dxreg  ;get value and skip to next
	fldws     (segmentdyreg,dybasereg),dyreg   ;get value
	fmul,dbl  dareg,dxreg,mulreg   
	fadd,dbl  mulreg,dyreg,dyreg
	addib,<   1,tripcount,iloop
	fstws,ma  dyreg,8(segmentdyreg,dybasereg)

	; Check for completion, and bump segment registers if appropriate.
	sub	gcnt,icnt,gcnt			; decrement global count
	combt,<= gcnt,r0,done			; check for completion
	comclr,=	dxbasereg,r0,r0		; increment space register that
	addi	1,dxsegshadow,dxsegshadow	; wrapped
	mtsr	dxsegshadow,segmentdxreg
	comclr,=	dybasereg,r0,r0		; increment space register that
	addi	1,dysegshadow,dysegshadow	; wrapped
	b	oloop0
	mtsr	dysegshadow,segmentdyreg
done:

> DBMS, and other things that follow pointer chains around.

> Conventional wisdom says that loads+stores are 30% of the code,
> and so some subset of these incur at least 1 extra cycle.

If every one of these loads and stores required twice as many cycles
(the case you mentioned is pretty much a worst case for a segmented
architecture), then the machine's performance would be reduced by 30%
in code that made heavy use of large objects.  What little intuition I
have in the matter suggests, however, that the actual overhead will be
significantly less than 30%, as this overhead would not be incurred on
every load or store.  Access to the stack, for example, would not incur
this overhead, nor would access to a small object after that object has
been located (in most cases objects less than the segment size can be
constrained to lie completely within a segment).

As a data point, HP's proprietary OS uses spaces (the term HP uses for
large segments) to support databases and file systems.  The segmentation
overhead they've incurred has not been large enough to warrant making
space register ops 1 cycle.

--bob--
renglish@hplabs
If I were speaking, I'd be speaking for myself.  Since I'm typing, I'm
typing for myself.