Path: utzoo!utgpu!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!spool.mu.edu!samsung!usc!zaphod.mps.ohio-state.edu!rpi!sarah!bingnews!kym From: kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) Newsgroups: comp.arch Subject: Re: standard extensions Message-ID: <1991Feb25.201406.18643@bingvaxu.cc.binghamton.edu> Date: 25 Feb 91 20:14:06 GMT References: <3381.27c548c3@iccgcc.decnet.ab.com> <1991Feb25.135057.23667@linus.mitre.org> Organization: State University of New York at Binghamton Lines: 97 In article <1991Feb25.135057.23667@linus.mitre.org> bs@faron.mitre.org (Robert D. Silverman) writes: >Yes. There always is assembler. However, the cost of calling an assembler >routine to do a division returning both quotient and remainder is very >expensive. In-lining mechanisms for assembler code are woefully inadequate >in current languages. One can, of course, look at the intermediate assembler >code generated by the compiler of a HLL and then adjust your assembler code >so that register usage is correct, but this is very clumsy and must be redone >everytime you make a change to the HLL code. I find it hard to criticize the facilities for inlining assembler code (not that I would ever do it myself -- well, hardly ever) provided by modern compilers, e.g. gcc as outlined in earlier posts to this group and elsewhere. At least some modern compilers would seem to provide the facilities for including such code in the (sometimes extensive) dataflow analysis they already perform for the HLL code -- even to the extent of allocating available registers for your use in the inline assembler code. It would therefore seem low-level tweaking as described above is at least on the way out, if not already `passed over'. But on the other hand -- why put assembler into a program at all? Although I understand there _are_ a (very few) application areas where even a 10% reduction in running time represents big bucks, I find it hard to accept that the ability to return the quotient and remainder, e.g., at the same time will make even _that_ much difference to code that already would almost certainly include more time-intensive instructions in the mix. For example, suppose we compare two tight loops, one with separate divide & modulus/remainder (whichever you need); the other with some combined code. Suppose there are n other instructions in each loop. Let's discard pipe flushing considerations and further assume all instructions take a single cycle. We have, in loop I say, n+3 instructions (let's even allow another move instruction to get at either the quotient or remainder as they may have ended up in inconvenient places) and in loop II n+1 instructions. Iterate both loops N times. Loop I takes N(n+3) cycles: loop II takes N(n+1) cycles -- the ratio is obviously (n+3)/(n+1). This will exceed a 10% difference if n < 19. So, provided there are no more than about 20 instructions in the loop in this obviously _idealized_ circumstance, the body of the loop will run 10% or more faster given the fancy combined quot/rem facility. Considering our loop is only _part_ of a larger program the impact on the total running time of the complete code will be even less marked given the facility. But perhaps typical loops, esp those containing divide and remainder instructions, are small? Well, after looking through my source code I don't happen to find _any_ loops with both divide & remainder in them -- I guess I tend to avoid it (blush). However, below is an _example_ that I will not say is _typical_, but is at least indicative of the kind of code I'm looking for (it's an inner loop from an FFT routine). With -O a Sun3/60 cc produces the code following. Strangely, there are about 20 instructions in the loop. ------ for(xp=x,yp=y,zp=z; yp