Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!rice!uw-beaver!uw-june!pardo From: pardo@cs.washington.edu (David Keppel) Newsgroups: comp.sys.m88k Subject: More noise about register-save-by-mask Summary: long Message-ID: <10069@june.cs.washington.edu> Date: 5 Dec 89 23:54:16 GMT Reply-To: pardo@june.cs.washington.edu (David Keppel) Organization: University of Washington, Computer Science, Seattle Lines: 145 [This has nothing to do with the m88k, as such. It is about an implementation of save-only-the-needed-registers that could be done on the 88k. If you're not interested, hit `n' now.] There's been an ongoing discussion about register saves. Somebody asked me to justify my assertion ``save-by-mask probably isn't worth it on a RISC''. I posted an analysis a while. Paul Wilson has suggested an alternative implementation that might be faster, but still probably doesn't make it worth it. The basic idea is to use the register save mask as an index in to a table. The table contains the addresses of routines that save (restore) registers. There is one routine for each of the 2**n possible mask values. This method looks like its average case is marginally better than the worst case for split caller saves/callee saves. Here's my writeup. >[Use the mask as an index in to a jump table.] # Call function `foo'. movi $mask, r16 call foo # Foo: foo: andl $mask, r16, r16 load jump_table(r16) movi return, r17 jump (r16) return: ... load return_table(r16) jump (r16) # NOTREACHED # Transfer tables jump_table: .address save_none .address save_0 .address save_1 .address save_01 .address save_2 ... # # A bunch of save routines, one for each of the 2**n possible # mask combinations. For 2**16, that might take about 1/2Mb. # Assumes that the mask is storead in r16 and the function # address is stored in r17. # The registers that can be saved-by-mask are in r0-r15. # save_none: jbr (r17) save_0: # save register 0 subl $4, sp, sp store r0, 4(sp) jbr (r17) save_1: # save register 1 subl $4, sp, sp store r0, 4(sp) jbr (r17) ... save_e: # save register 15 subl $4, sp, sp store r0, 4(sp) jbr (r17) save_01: # save registers 0 and 1 subl $8, sp, sp store r0, 4(sp) store r1, 8(sp) jbr (r17) save_02: # save registers 0 and 2 subl $8, sp, sp store r0, 4(sp) store r2, 8(sp) jbr (r17) ... # # Note: some compaction is possible at some expense: # # ... # save_012: # adjust sp # save r2 # save_01: # adjust sp # save r1 # save_0: # adjust sp # save r0 # save_none: # jbr ... # # A bunch of restore routines, one for each 2**n possible # mask combinations. (~1/2Mb.) # Assumes that the mask is stored in r16 and the return # address is in r17. # restore_none: ret restore_0: load 4(sp), r0 addl $4, sp, sp ret So I count 1 mask instruction at caller site. 4 mask/indirect instructions at callee site. 1-18 (average 4?) in jump table 1 indirect instruction at callee site (plus possible reload of r16) 1-18 (average same as jump table) at return table. Adds up to 14 instructions. If you belive Robert Firth's claim of average 2 registers saved with caller saves, you're *way* at a loss. The worst case for caller/callee saves (according to my previous analysis) is 8 saves, 8 restores. If each takes 1 cycle, then you've got 16 cycles. If you believe my claim (see also last posting) that you might normally need to save only, say, r1 and r3, then you're kicking 14 cycles with a bunch of possible branch penalties vs. 16 cycles of straight-line code. There, wasn't that satisfying? By the time this has spread across the USENET backbone, I will have saved about 1 billion processor cycles :-) ;-D on ( Counting rings on a forest ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo