Path: utzoo!attcan!uunet!cs.utexas.edu!sun-barr!apple!amdcad!rpw3 From: rpw3@amdcad.AMD.COM (Rob Warnock) Newsgroups: comp.arch Subject: Re: Register usage [was Re: 80486 vs. 68040 code size] Message-ID: <25602@amdcad.AMD.COM> Date: 12 May 89 05:40:43 GMT References: <921@aber-cs.UUCP> <1989May11.210653.2125@utzoo.uucp> Reply-To: rpw3@amdcad.UUCP (Rob Warnock) Organization: [Consultant] San Mateo, CA Lines: 38 One thing Tim Olson did leave out was the *dynamic* use of the Am29000 local register file (a.k.a. stack cache). Since with the Am29000, "spilling" a few regs to memory to make room for a new subroutine context usually does not cause a matching "fill" when the routine returns [those 128 local regs give you a *lot* of hysteresis], the interesting "knee of the curve" is in the tradeoff between cache spill/fill traffic and register cache size. That is, how much incremental spill/fill traffic do you save (thus incremental bus bandwidth saved and thus performance gained) for each additional register you put in the stack cache? I forget the exact numbers, but as I recall, at 128 local registers (which at ~7.5 local regs/routine would mean ~17 subroutine contexts "live" in the cache) the slope was still about 0.1% of the CPU gained for each additional reg you added to the cache. [Correct me if I'm way off, Tim.] That's not enough to be worth doubling the local register file to 256, but is certainly enough to not want to cut the size of the cache back to 64 (especially since as you started cutting back the cost/reg would go up). This is also why -- given the very large register stack cache -- the compilers bother to reuse locals: If you can keep the average subroutine context small, you can keep a lot of live (parent) contexts in the cache without spilling/filling. This has to be traded off against performance lost by *not* using another reg when it could save CPU time. The "cost" of 0.1% (due to spill/fill traffic) for another reg is a parameter to the compiler's register allocation algorithm. And sometimes the C compiler *does* use more than 32 locals (for really messy routines with lots of nested loops); it's just that the overall number of registers spilled/filled per subroutine call is *very* low. [In fact, much less than 1 for any of Dhrystone-1/-2/diff/grep/as/nroff]. Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403