Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!sdd.hp.com!hplabs!hpcc05!aspen!huck From: huck@aspen.IAG.HP.COM (Jerry Huck) Newsgroups: comp.arch Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers) Message-ID: <1360009@aspen.IAG.HP.COM> Date: 2 Apr 91 01:48:40 GMT References: Organization: HP Information Architecture Group - Cupertino, CA Lines: 206 Let me try to explain some the ways PA-RISC is used by HP-UX and its relationship to segmentation. But first a couple of notes on PA-RISC segmentation. PA-RISC uses segmentation to extend the addressability of the normal general register file. It is not a partition of these registers into pieces. Segments are 2^32 in size and give capability in several areas. At the point when register sizes increase (such as the R4000 path) one expects the segmentation size to increase. The crucial tradeoffs are in silicon area for register files, datapaths, and ALUs, that is, the pieces of the CPU that must be increased to accommidate larger flat addressing. So for HP, segmentation was not a trade-off against flat addressing, but rather: is it useful to extend beyond the maximum flat addressing you can support in your general register file? At the time, 1982-1983, 32-bit general registers gave at least a ten year horizon. Wider registers would have resulted in non-competitive machines in the existing technology. I think most of the arguments against segmentation assume you give up some flat addressing to get it. That's not necessary. The inclusion of segmentation offered an efficient scheme to extend addressability with little hardware cost. All the hardware support for this extended addressing is well partitioned in the TLB control with no worse cycle-time cost than process ID extensions found in per process TLBs (we assumed flushing the TLB on context switch is to be avoided). The primary benefactors are the OS and database subsystems. The presence of segmentation (what we call long addresses) is not exposed to the programs (not to mention that languages have no way to talk about segmentation). We find many situations were objects remain <2^32 in size yet the aggregrate space greatly exceeds 2^32. Larger objects can be managed if some additional structure exists. For example, a large database can span multiple segments when all database accesses deal with page size buckets (not uncommon). There are many ways to solve all these problems; we found segmentation in PA-RISC to very effective in dealing with these applications. >In comp.arch, mash@mips.com (John Mashey) writes: > In article <1991Mar27.193512.12417@cello.hpl.hp.com> renglish@cello.hpl.hp.com (Bob English) writes: > ... > >I would characterize such objects as belonging to three general types. > >The first is a large object accessed in a regular way, a large array or > >matrix, for example. Segment loading and unloading in such an object > >will be rare, because the compiler will know the segment boundaries and > >be able to optimize them out of the code. > I don't quite understand this, but I could be convinced. In fact, this > could lead to an interesting discussion. Let me suggest the simplest > conceivable comparison, which is to take the inner loop of the rolled > DAXPY routine from linpack - code included later, but whose salient feature > is: > do 30 i = 1,n > dy(i) = dy(i) + da*dx(i) > 30 continue > where dy,dx,da, and n all arrive to the code as arguments. > Maybe someone would post the likely code, for the loop above, for an > architecture with > segmentation (HP PA would be interesting, as the scheme seems generally > well-thought-out, and HP's compilers are good), for the following cases: In general, you would not attempt to let objects (especially fortran arrays) span segment (what we call space) boundaries and generate run-time checks for crossing. As suggested above, we generally confine normal objects to a single flat space of 32 bits. > 1) Standard HP-UX, i.e., what do you get if you assume flat > addressing? Nothing unusual. The normal loads and stores one normally expects. HP-UX only presents the short (roughly flat) addressing mode to the user. There's a little complication with short addressing that might create short pointer to long pointer conversions (2 instructions) when the compiler is not sure if zero based array addressing would wrap into another short pointer quadrant. > 2) What you would get, if dy and dx can be in separate segments, > and neither is >4GB? (easy case: just load up 2 segment regs, > once). On HP-UX this is speculation mode since we don't support it. But if we did, then the sequence would be something like: loop: fldws,ma 8(segmentdxreg,dxbasereg),dxreg ;get value and skip to next fldws (segmentdyreg,dybasereg),dyreg ;get value fmul,dbl dareg,dxreg,mulreg fadd,dbl mulreg,dyreg,dyreg addib,< 1,tripcount,loop fstws,ma dyreg,8(segmentdyreg,dybasereg) > 3) What you need to do in the general case, which is that either > dx or dy, or both could be >4GB, or (enough to cause the problem) > that either or both cross segment boundaries? > (I think this code either takes the easy way out, and does > 2 segment manipulations per iteration, or else gets compiled into > something much more complex, but I can be convinced.) As suggested earlier, this is not what we use segmentation for. If you need > 32 bit indexes you probably need > 32 bit registers. If common objects are bigger than 2^32 bytes, then you would want > 32 bit flat addressing. At least simulating this on PA-RISC would be faster than any other shipping RISC microprocessor :-). (Well at least SPARC, MIPS, 88K, and RS6000). Of course that doesn't matter, if it's important, you'll want flat addressing that does it more simply. > Recall that the likely situation to be faced is that some FORTRAN > programmer is told they can have bigger arrays, and they simply set the > sizes of the arrays up, recompile, and want it to work. Note also, that > FORTRAN storage allocation has certain implications for what you can and > can't do regarding rearrangement of where data is. (Also, > a question: I assume on HP PA implementations that Move-to-Space Register > instructions are 1-cycle operations, with no additional latency needed > before a load/store? Hmm. I'm not sure on that. I would not spend much silicon making that superfast given the typical use. > Another question, since PA has 4 Space Registers > that user code can play with (I think), are there conventions for their > use, i.e., like callee-save - caller-save conventions for the regular > registers? or are they all caller-save? (I ask because the code for sr0,sr1,sr2 are caller saves, sr3,sr4 are callee saves, and sr5, sr6, sr7 are managed by the OS and not writable by the user. > >The second is a large object accessed unpredictably with no locality. > >While the compiler will not be able to predict the segmentation register > >in such cases, neither will the cache be able to hold the working set, > >so that miss penalties dominate the additional segment register loads. > Agreed. If there is no locality, cache and TLB missing eats the machines. > >The third is a large object accessed unpredictably, but with a high > >degree of locality. In such cases, loads and stores take up to one > >additional instruction. Only in this case do segments make any > >difference in the performance of the machine, and even in this case the > >difference is small. I don't claim to be an expert in such matters, but > >I suspect the number of applications fitting this last category is small. > DBMS, and other things that follow pointer chains around. > Conventional wisdom says that loads+stores are 30% of the code, > and so some subset of these incur at least 1 extra cycle. > However, I suspect that in the general case, you have to keep track > of the segment numbers, and pass them around, just like you do > on X86 with far pointers, and hence there are more instructions, > and in addition, need to keep the space numbers around in integer > registers for speed in some cases. (Note that every pointer reference > is conceptually 64-bits, and hence, every pointer argumement needs 2 > 32-bit quantities, and probably close to 2X more instructions to set up. > Also, consider the code on a 32-bit machine for: > *p = *q; > where both p and q are pointer to pointers. and bot start in memory: > this would typically look like (on typical 32-bit RISC): > load r1,q > load r2,p > load r3,0(r1) > store r3,0(r2) > I think this turns into, on smething like HP PA (but correct me if I'm wrong), > and assuming that c pointers turn into 64-bit things: > load r1,q > load r4,q+4 get SPACE ID > movetospaceregister r4,somewhere1 > load r2,p > load r5,p+4 get SPACE ID > movetospaceregister r5,somewhere2 > load r3,0(r1) and do whatever you have to to get somewher1 > load r6,4(r1) get SPACE ID > store r3,0(r2) save the pointer; do what you must to get somewhere2 > store r6,4(r2) save the SPACE ID > In this case, 4 instructions have turned into 10. I wouldn't preend this > example is typical or not, and I'd expect compilers would do better, > but it is illustrative of what could happen. Alternatively, any reuse of the pointer avoids the movetospace operations when dealing with 32bit objects. Any looping or database like access to records would also avoid the overhead. > Anyway, to get some serious analysis of this, I think one has to > look at code sequences under various assumptions, and see > a) What speed is obtainable by perfect hand-code? > b) How likely are compilers to get there? I'm not sure what "this" is but one would certainly not propose segmentation as the mechanism to address common array objects that exceed the flat addressability of the machine. Nor would you use 32bit load instructions when the primary pointer size was > 32 bits (not that John was). It would be similar to an architecture that only only allowed loading 32 bit floating-point variables :-). HP-UX and the proprietary MPE/XL operating systems make use of long pointers as well as some of our database vendors. It is very convenient to be able to directly access > 2^32 bytes without operating system involvement. Just don't get carried away with it. Jerry Huck Hewlett Packard