Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!cello!renglish From: renglish@cello.hpl.hp.com (Bob English) Newsgroups: comp.arch Subject: Re: Segmented Architectures ( formerly Re: 48-bit computers) Message-ID: <1991Apr03.193215.16441@cello.hpl.hp.com> Date: 3 Apr 91 19:32:15 GMT References: <1659@spim.mips.COM> Organization: Hewlett Packard Labs Lines: 124 I want to make something clear up front. I am not trying to convince the world at large that segmentation is a better way of providing a large address space to a single program than a linear address space with register size equalling the address size. Neither am I trying to take a position on the best use of current silicon space or the minimum usable address space. What I take issue with is the opinion, expressed many times in comp.arch, that segmentation is inherently wrong, violates all principles of good design, and implies severe brain damage on the part of the designers. The point that I'm trying to make is that segmentation at the hardware level, or the lack thereof, is not an issue of architectural principle, but a design choice with a set of costs and benefits. Elevating it to a principle implies that the only acceptable address space is infinite, because no programmer should ever have to worry about addressability. At any point, it's a choice between the costs of extending the address space (register size, etc.) and the benefits derived from doing so, as well as a choice of system level to provide the service. mash@mips.com (John Mashey) writes: > ...take the inner loop of the rolled DAXPY routine from linpack...: > 3) What you need to do in the general case, which is that either > dx or dy, or both could be >4GB, or (enough to cause the problem) > that either or both cross segment boundaries? Well, this is a bit longer than the code Jerry sent out for the current case, but it isn't too complicated. It's 30 instructions, two or three times that of the unsegmented code posted earlier (after initialization is added to the earlier code), but the inner loop is unchanged. In a machine where the compilers dealt effectively with segments, this would be a normal form for striding through arrays, and would be highly optimized (at least as good as this). Evaluating the performance impact is a bit trickier. The inner loop is unchanged, but the set up costs are higher. For long loops, this is inconsequential. For short loops, it adds about 14 cycles to the loop, or about 12% for a vector length of 20 (there are probably ways to reduce those costs for short vectors without appreciably increasing the overhead for long vectors, but that's not important). How important is this increased overhead? It seems counterintuitive that programs demanding objects greater than 32 bits would have their performance dominated by small vectors, but it could be true. With one DAXPY to a 2^^32 array, there would have to be 200 million DAXPYs to twenty element arrays before the 12% difference in short loop performance became a 6% increase in actual performance. If those accesses were themselves in a loop, and global optimizations were performed, the overhead would drop way down. The code: mtsr dysegshadow,segmentdyreg mtsr dxsegshadow,segmentdxreg ; This section eliminates long (> 2^^30) internal runs to simplify ; the later tests. "ocnt" gets the projected run size for the ; inner loop. zdepi 3,1,2,maxrun ; set up max run oloop0: combt,<< gcnt,maxrun,lessmax ; nullifies on gcnt << maxrun copy,tr maxrun ; always nullifies lessmax: copy gcnt,ocnt ; nullified if dropped in ; This section checks for segmentation wraps, so that the inner loop ; won't have to. "icnt" gets the maximum base register, and then ; the actual inner loop count. oloop1: comclr,<<= dxbasereg,dybasereg,r0 ; which base is higher? or,tr dxbasereg,r0,icnt ; or,tr always nullifies or dybasereg,r0,icnt ; this instruction sh3add ocnt,icnt,tmp1 ; will the higher base wrap? combf,<<,n tmp1,icnt,iloopstart ; subi 7,icnt,icnt ; reduce the inner loop cnt extrs,tr icnt,1C,1D,icnt ; to the wrap point iloopstart: or ocnt,r0,icnt subi 0,icnt,tripcount ; This is the inner loop, same as without segments. iloop: fldws,ma 8(segmentdxreg,dxbasereg),dxreg ;get value and skip to next fldws (segmentdyreg,dybasereg),dyreg ;get value fmul,dbl dareg,dxreg,mulreg fadd,dbl mulreg,dyreg,dyreg addib,< 1,tripcount,iloop fstws,ma dyreg,8(segmentdyreg,dybasereg) ; Check for completion, and bump segment registers if appropriate. sub gcnt,icnt,gcnt ; decrement global count combt,<= gcnt,r0,done ; check for completion comclr,= dxbasereg,r0,r0 ; increment space register that addi 1,dxsegshadow,dxsegshadow ; wrapped mtsr dxsegshadow,segmentdxreg comclr,= dybasereg,r0,r0 ; increment space register that addi 1,dysegshadow,dysegshadow ; wrapped b oloop0 mtsr dysegshadow,segmentdyreg done: > DBMS, and other things that follow pointer chains around. > Conventional wisdom says that loads+stores are 30% of the code, > and so some subset of these incur at least 1 extra cycle. If every one of these loads and stores required twice as many cycles (the case you mentioned is pretty much a worst case for a segmented architecture), then the machine's performance would be reduced by 30% in code that made heavy use of large objects. What little intuition I have in the matter suggests, however, that the actual overhead will be significantly less than 30%, as this overhead would not be incurred on every load or store. Access to the stack, for example, would not incur this overhead, nor would access to a small object after that object has been located (in most cases objects less than the segment size can be constrained to lie completely within a segment). As a data point, HP's proprietary OS uses spaces (the term HP uses for large segments) to support databases and file systems. The segmentation overhead they've incurred has not been large enough to warrant making space register ops 1 cycle. --bob-- renglish@hplabs If I were speaking, I'd be speaking for myself. Since I'm typing, I'm typing for myself.