Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!rice!uw-beaver!uw-june!pardo
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.sys.m88k
Subject: More noise about register-save-by-mask
Summary: long
Message-ID: <10069@june.cs.washington.edu>
Date: 5 Dec 89 23:54:16 GMT
Reply-To: pardo@june.cs.washington.edu (David Keppel)
Organization: University of Washington, Computer Science, Seattle
Lines: 145

[This has nothing to do with the m88k, as such.  It is about an
implementation of save-only-the-needed-registers that could be done on
the 88k.  If you're not interested, hit `n' now.]


There's been an ongoing discussion about register saves.  Somebody
asked me to justify my assertion ``save-by-mask probably isn't worth
it on a RISC''.  I posted an analysis a while.  Paul Wilson has
suggested an alternative implementation that might be faster, but
still probably doesn't make it worth it.

The basic idea is to use the register save mask as an index in to a
table.  The table contains the addresses of routines that save
(restore) registers.  There is one routine for each of the 2**n
possible mask values.

This method looks like its average case is marginally better than the
worst case for split caller saves/callee saves.  Here's my writeup.

>[Use the mask as an index in to a jump table.]

	# Call function `foo'.

		movi $mask, r16
		call foo

	# Foo:
	foo:
		andl $mask, r16, r16
		load jump_table(r16)
		movi return, r17
		jump (r16)
	return:
		...
		load return_table(r16)
		jump (r16)
		# NOTREACHED

	# Transfer tables
	jump_table:
		.address save_none
		.address save_0
		.address save_1
		.address save_01
		.address save_2
	    ...

	#
	# A bunch of save routines, one for each of the 2**n possible
	# mask combinations.  For 2**16, that might take about 1/2Mb.
	# Assumes that the mask is storead in r16 and the function
	# address is stored in r17.
	# The registers that can be saved-by-mask are in r0-r15.
	#

	save_none:
		jbr (r17)
	save_0:	# save register 0
		subl $4, sp, sp
		store r0, 4(sp)
		jbr (r17)
	save_1:	# save register 1
		subl $4, sp, sp
		store r0, 4(sp)
		jbr (r17)
	    ...
	save_e:	# save register 15
		subl $4, sp, sp
		store r0, 4(sp)
		jbr (r17)
	save_01:
		# save registers 0 and 1
		subl $8, sp, sp
		store r0, 4(sp)
		store r1, 8(sp)
		jbr (r17)
	save_02:
		# save registers 0 and 2
		subl $8, sp, sp
		store r0, 4(sp)
		store r2, 8(sp)
		jbr (r17)
	    ...

	#
	# Note: some compaction is possible at some expense:
	#
	#	...
	#	save_012:
	#		adjust sp
	#		save r2
	#	save_01:
	#		adjust sp
	#		save r1
	#	save_0:
	#		adjust sp
	#		save r0
	#	save_none:
	#		jbr ...

	#
	# A bunch of restore routines, one for each 2**n possible
	# mask combinations.  (~1/2Mb.)
	# Assumes that the mask is stored in r16 and the return
	# address is in r17.
	#
	restore_none:
		ret
	restore_0:
		load 4(sp), r0
		addl $4, sp, sp
		ret


So I count
	1 mask instruction at caller site.
	4 mask/indirect instructions at callee site.
	1-18 (average 4?) in jump table
	1 indirect instruction at callee site (plus possible reload
		of r16)
	1-18 (average same as jump table) at return table.

	Adds up to 14 instructions.

	If you belive Robert Firth's claim of average 2 registers
	saved with caller saves, you're *way* at a loss.

	The worst case for caller/callee saves (according to my
	previous analysis) is 8 saves, 8 restores.  If each takes 1
	cycle, then you've got 16 cycles.  If you believe my claim
	(see also last posting) that you might normally need to save
	only, say, r1 and r3, then you're kicking 14 cycles with a
	bunch of possible branch penalties vs. 16 cycles of
	straight-line code.

There, wasn't that satisfying?

By the time this has spread across the USENET backbone, I will have
saved about 1 billion processor cycles :-)

	;-D on  ( Counting rings on a forest )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo