Path: utzoo!news-server.csri.toronto.edu!rutgers!usc!elroy.jpl.nasa.gov!decwrl!world!iecc!compilers-sender From: rcg@lpi.liant.com (Rick Gorton) Newsgroups: comp.compilers Subject: Re: Instruction reordering (scheduling) for SPARC Keywords: optmize, design Message-ID: <9103122023.AA14689@lpi.liant.com> Date: 12 Mar 91 20:23:09 GMT Sender: compilers-sender@iecc.cambridge.ma.us Reply-To: rcg@lpi.liant.com (Rick Gorton) Organization: Compilers Central Lines: 72 Approved: compilers@iecc.cambridge.ma.us Fair warning, this is a fairly lengthly response. Peter Van Roy writes: > > I am in the process of retargeting a compiler for the SPARC. I am building > an instruction reordering stage. To achieve the best performance, I need > information about the memory system and the pipeline structure of several > implementations of the SPARC. There is good news and bad news. Bad news first. The bad news is that the pipelining and instruction timing characteristics depend upon which silicon manufacturer built the chip, and in particular, which chipset was used. If you can GUARANTEE that all SPARCstation 1+ machines use chipset X and all SPARCstation 2's use chipset Y, and you don't care at all about possibly not having optimal performance on other chipsets, then getting the information is merely a matter of talking to the particular chip manufacturer for the SPARCstation 1+ for the 1+ info, and to the chip manufacturer of the 2 for the 2 information. It MAY actually be that different firms are manufacturing the CPUs. The following is from a post by Michael Slater of Microprocessor Report. He posted this to comp.arch Dec, 31m 1990: ] LSI Logic's "Lightning" SPARC processor. Five-chip superscalar ] implementation, dispatches up to four instructions per clock. Uses out-of- ] order instruction execution, speculative execution, and register relabeling. ] ] Texas Instruments' "Viking" SPARC processor. Superscalar and superpipelined, ] dispatches up to three instructions per clock. On-chip caches approximately ] 16 Kbytes each for instructions and data. ] ] Cypress/ROSS Technology's "Pinnacle" SPARC processor. Superscalar, dispatches ] up to two instructions per clock cycle. On chip cache approximately 16 ] Kbytes, external MMU and controller for second-level cache. ] ] SPARC processors combining existing integer and floating-point units from ] Fujitsu and LSI Logic. The good news is that there is SOME information in the SPARC Architecture manual (Version 7) about Instruction scheduling. I can't seem to find the specific section number right now, but the gist of it (as I recall it) was that the IU and FPU can execute instructions simultaneously. Which means that you can get a win by scheduling IU instructions alternately with FPU instructions. Now for specifics (where I have info) > How many cycles are needed to do a load and a store? > Is there any advantage (apart from needing only a single instruction > fetch) to the double-word loads and stores? CHIP Cycle Times LD LDD ST STD LSI L64811: 2 3 3 4 Cypress CY7C601: 2 3 3 4 Fujitsu MB86901: 2 3 3 4 The better news is that, yes, these 3 chipsets all happen to have the same cycle times. But you cannot guarantee this to be true in the future. It will be messy to write an instruction scheduler for a compiler which can generate differently scheduled code for different chipsets by merely using a different compile-time switch. I think you will find that your biggest performance gains will be in scheduling to fill stalls created by the slower floating point instructions, FDIV, FMUL, and FSQRT. Hope this helps. Richard Gorton rcg@lpi.liant.com (508) 626-0006 Language Processors, Inc. Framingham, MA 01760 -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.