Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!mash From: mash@mips.com (John Mashey) Newsgroups: comp.arch Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers) Message-ID: <1659@spim.mips.COM> Date: 31 Mar 91 06:31:31 GMT References: <23189@as0c.sei.cmu.edu> <1991Mar27.193512.12417@cello.hpl.hp.com> Sender: news@mips.COM Organization: MIPS Computer Systems, Inc. Lines: 116 Nntp-Posting-Host: winchester.mips.com Note: people interested in this topic should especially consider attending the ASPLOS panel run by Dave Patterson, which includes a panel and audience discussion of several topics, including segmentation for >32-bits. In article <1991Mar27.193512.12417@cello.hpl.hp.com> renglish@cello.hpl.hp.com (Bob English) writes: ... >I would characterize such objects as belonging to three general types. >The first is a large object accessed in a regular way, a large array or >matrix, for example. Segment loading and unloading in such an object >will be rare, because the compiler will know the segment boundaries and >be able to optimize them out of the code. I don't quite understand this, but I could be convinced. In fact, this could lead to an interesting discussion. Let me suggest the simplest conceivable comparison, which is to take the inner loop of the rolled DAXPY routine from linpack - code included later, but whose salient feature is: do 30 i = 1,n dy(i) = dy(i) + da*dx(i) 30 continue where dy,dx,da, and n all arrive to the code as arguments. Maybe someone would post the likely code, for the loop above, for an architecture with segmentation (HP PA would be interesting, as the scheme seems generally well-thought-out, and HP's compilers are good), for the following cases: 1) Standard HP-UX, i.e., what do you get if you assume flat addressing? 2) What you would get, if dy and dx can be in separate segments, and neither is >4GB? (easy case: just load up 2 segment regs, once). 3) What you need to do in the general case, which is that either dx or dy, or both could be >4GB, or (enough to cause the problem) that either or both cross segment boundaries? (I think this code either takes the easy way out, and does 2 segment manipulations per iteration, or else gets compiled into something much more complex, but I can be convinced.) Recall that the likely situation to be faced is that some FORTRAN programmer is told they can have bigger arrays, and they simply set the sizes of the arrays up, recompile, and want it to work. Note also, that FORTRAN storage allocation has certain implications for what you can and can't do regarding rearrangement of where data is. (Also, a question: I assume on HP PA implementations that Move-to-Space Register instructions are 1-cycle operations, with no additional latency needed before a load/store? Hmm. Another question, since PA has 4 Space Registers that user code can play with (I think), are there conventions for their use, i.e., like callee-save - caller-save conventions for the regular registers? or are they all caller-save? (I ask because the code for do 30 i = 1,n dy(i) = dy(i) + da*dx(i) 30 continue AND do 30 i = 1,n dy(i) = dy(i) + da*dx(i) call function(da) 30 continue could look rather different in their ability to just set the Space registers and be done with it. >The second is a large object accessed unpredictably with no locality. >While the compiler will not be able to predict the segmentation register >in such cases, neither will the cache be able to hold the working set, >so that miss penalties dominate the additional segment register loads. Agreed. If there is no locality, cache and TLB missing eats the machines. >The third is a large object accessed unpredictably, but with a high >degree of locality. In such cases, loads and stores take up to one >additional instruction. Only in this case do segments make any >difference in the performance of the machine, and even in this case the >difference is small. I don't claim to be an expert in such matters, but >I suspect the number of applications fitting this last category is small. DBMS, and other things that follow pointer chains around. Conventional wisdom says that loads+stores are 30% of the code, and so some subset of these incur at least 1 extra cycle. However, I suspect that in the general case, you have to keep track of the segment numbers, and pass them around, just like you do on X86 with far pointers, and hence there are more instructions, and in addition, need to keep the space numbers around in integer registers for speed in some cases. (Note that every pointer reference is conceptually 64-bits, and hence, every pointer argumement needs 2 32-bit quantities, and probably close to 2X more instructions to set up. Also, consider the code on a 32-bit machine for: *p = *q; where both p and q are pointer to pointers. and bot start in memory: this would typically look like (on typical 32-bit RISC): load r1,q load r2,p load r3,0(r1) store r3,0(r2) I think this turns into, on smething like HP PA (but correct me if I'm wrong), and assuming that c pointers turn into 64-bit things: load r1,q load r4,q+4 get SPACE ID movetospaceregister r4,somewhere1 load r2,p load r5,p+4 get SPACE ID movetospaceregister r5,somewhere2 load r3,0(r1) and do whatever you have to to get somewher1 load r6,4(r1) get SPACE ID store r3,0(r2) save the pointer; do what you must to get somewhere2 store r6,4(r2) save the SPACE ID In this case, 4 instructions have turned into 10. I wouldn't preend this example is typical or not, and I'd expect compilers would do better, but it is illustrative of what could happen. Anyway, to get some serious analysis of this, I think one has to look at code sequences under various assumptions, and see a) What speed is obtainable by perfect hand-code? b) How likely are compilers to get there? -- -john mashey DISCLAIMER: UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086