Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender
From: pardo@cs.washington.edu (David Keppel)
Newsgroups: comp.compilers
Subject: Re: C Compilers which use full 486 capabilities
Keywords: 486, optimize
Message-ID: <15501@june.cs.washington.edu>
Date: 18 Mar 91 17:46:49 GMT
References: <9103130754.AA19293@gara.une.oz.au> <wbsAAQr0BwwMR4Tu8E@transarc.com> <1991Mar15.173551.13702@rice.edu>
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: pardo@cs.washington.edu (David Keppel)
Organization: University of Washington, Computer Science, Seattle
Lines: 91
Approved: compilers@iecc.cambridge.ma.us

>mason+@transarc.com (Tony Mason) writes:
>>[quoting from Dr. Dobbs]
>>[Old optimization rules fail on the 486, where, a move to or from memory
>> takes one cycle, but exchanging two registers takes three cycles.]

preston@ariel.rice.edu (Preston Briggs) writes:
>Does anyone's compiler generate an exchange instruction?
>I can visualize uses, but I'm not sure how I'd recognize it.
>How would it integrate with register allocation?

John Levine (compilers-request) writes:
>[Most uses I have seen are for atomic operations.]

It is certainly the case that exchange is used for atomic operations.
But that is because there are no `ordinary' instructions that will do
the trick.  The other half of the question is ``what non-atomic
operations can make use of the exchange instruction?''  I have an
answer in two parts: First, example code that uses an abstract
exchange operation.  Second, details of the the iX86 `xchg' operation.


* Example Code

Here is code that the compiler could recognize as using exchange.  I
saw this code yesterday in a data compression algorithm:

	t := prev_val;
	prev_val := a[i];
	a[i] := t;

A reasonable implementation could make use of exchange, and a simple
peephole optimizer could replace instances of

	mov rx, ry
	ld rx, addr
	st ry addr

with

	xchg rx, addr

(as long as `ry' is not used subsequently).  I can imagine a register
allocator that instead of doing spills and restores with

	mov r0, -46[fp]
	mov -50[fp], r0

figures out the lifetimes of the variables, figures out that they're
disjoint (after all, the stack slots got generated in the first place
becase there weren't enough empty registers...), creates one merged
stack slot, and replaces the above with

	xchg  r0, -44[fp]

This has advantages of (a) smaller code size and (b) better stack
locality (decreasing the stack size and increasing the utilization of
particular stack slots).  Note that no analysis of user code is needed
for this latter operation, the spills and restores are
compiler-generated.


* iX86 `XCHG'

The iX86 has a `lock' prefix that can be applied to several
instructions.  When those instructions are `lock'-prefixed, they are
guaranteed to behave atomically wrt memory.  On a multiprocessor, if
two processors simultaneously execute

	lock incl 4[r0]

then the value at 4[r0] is guaranteed to be incremented by 2 (one for
each processor).  The `xchg' instruction can have the `lock' prefix
applied to it but -- for reasons I know not what -- omitting the
`lock' prefix has exactly the same effect, namely the exchange
operation is atomic whether or not you use the `lock' prefix.

Thus, on the i486 reads and writes that hit in the cache never go
off-chip.  Except that all `xchg' instructions must assert a line off
chip saying ``don't touch (this) memory right now''.  The 3-cycle
estimate is optmistic -- on a multiprocessor with a heavily-loaded
bus, atomic bus operations take substantially longer.

To summarize: it isn't hard to think of cases when exchange could be
useful, but you don't want to do it on the iX86 family because the
exchange instruction is defined as having atomic semantics that must
be enforced at any cost.

		;-D on  ( Look, ma, no peephole! )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.