Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!think.com!mintaka!spdcc!iecc!compilers-sender From: pardo@cs.washington.edu (David Keppel) Newsgroups: comp.compilers Subject: Re: C Compilers which use full 486 capabilities Keywords: 486, optimize Message-ID: <15501@june.cs.washington.edu> Date: 18 Mar 91 17:46:49 GMT References: <9103130754.AA19293@gara.une.oz.au> <1991Mar15.173551.13702@rice.edu> Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: pardo@cs.washington.edu (David Keppel) Organization: University of Washington, Computer Science, Seattle Lines: 91 Approved: compilers@iecc.cambridge.ma.us >mason+@transarc.com (Tony Mason) writes: >>[quoting from Dr. Dobbs] >>[Old optimization rules fail on the 486, where, a move to or from memory >> takes one cycle, but exchanging two registers takes three cycles.] preston@ariel.rice.edu (Preston Briggs) writes: >Does anyone's compiler generate an exchange instruction? >I can visualize uses, but I'm not sure how I'd recognize it. >How would it integrate with register allocation? John Levine (compilers-request) writes: >[Most uses I have seen are for atomic operations.] It is certainly the case that exchange is used for atomic operations. But that is because there are no `ordinary' instructions that will do the trick. The other half of the question is ``what non-atomic operations can make use of the exchange instruction?'' I have an answer in two parts: First, example code that uses an abstract exchange operation. Second, details of the the iX86 `xchg' operation. * Example Code Here is code that the compiler could recognize as using exchange. I saw this code yesterday in a data compression algorithm: t := prev_val; prev_val := a[i]; a[i] := t; A reasonable implementation could make use of exchange, and a simple peephole optimizer could replace instances of mov rx, ry ld rx, addr st ry addr with xchg rx, addr (as long as `ry' is not used subsequently). I can imagine a register allocator that instead of doing spills and restores with mov r0, -46[fp] mov -50[fp], r0 figures out the lifetimes of the variables, figures out that they're disjoint (after all, the stack slots got generated in the first place becase there weren't enough empty registers...), creates one merged stack slot, and replaces the above with xchg r0, -44[fp] This has advantages of (a) smaller code size and (b) better stack locality (decreasing the stack size and increasing the utilization of particular stack slots). Note that no analysis of user code is needed for this latter operation, the spills and restores are compiler-generated. * iX86 `XCHG' The iX86 has a `lock' prefix that can be applied to several instructions. When those instructions are `lock'-prefixed, they are guaranteed to behave atomically wrt memory. On a multiprocessor, if two processors simultaneously execute lock incl 4[r0] then the value at 4[r0] is guaranteed to be incremented by 2 (one for each processor). The `xchg' instruction can have the `lock' prefix applied to it but -- for reasons I know not what -- omitting the `lock' prefix has exactly the same effect, namely the exchange operation is atomic whether or not you use the `lock' prefix. Thus, on the i486 reads and writes that hit in the cache never go off-chip. Except that all `xchg' instructions must assert a line off chip saying ``don't touch (this) memory right now''. The 3-cycle estimate is optmistic -- on a multiprocessor with a heavily-loaded bus, atomic bus operations take substantially longer. To summarize: it isn't hard to think of cases when exchange could be useful, but you don't want to do it on the iX86 family because the exchange instruction is defined as having atomic semantics that must be enforced at any cost. ;-D on ( Look, ma, no peephole! ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.