Path: utzoo!news-server.csri.toronto.edu!rutgers!usc!elroy.jpl.nasa.gov!decwrl!world!iecc!compilers-sender
From: rcg@lpi.liant.com (Rick Gorton)
Newsgroups: comp.compilers
Subject: Re: Instruction reordering (scheduling) for SPARC
Keywords: optmize, design
Message-ID: <9103122023.AA14689@lpi.liant.com>
Date: 12 Mar 91 20:23:09 GMT
Sender: compilers-sender@iecc.cambridge.ma.us
Reply-To: rcg@lpi.liant.com (Rick Gorton)
Organization: Compilers Central
Lines: 72
Approved: compilers@iecc.cambridge.ma.us

Fair warning, this is a fairly lengthly response.
Peter Van Roy writes:
> 
> I am in the process of retargeting a compiler for the SPARC.  I am building
> an instruction reordering stage.  To achieve the best performance, I need
> information about the memory system and the pipeline structure of several
> implementations of the SPARC.

There is good news and bad news.  Bad news first.

The bad news is that the pipelining and instruction timing characteristics
depend upon which silicon manufacturer built the chip, and in particular,
which chipset was used.  If you can GUARANTEE that all SPARCstation 1+
machines use chipset X and all SPARCstation 2's use chipset Y, and you don't
care at all about possibly not having optimal performance on other chipsets,
then getting the information is merely a matter of talking to the particular
chip manufacturer for the SPARCstation 1+ for the 1+ info, and to the chip
manufacturer of the 2 for the 2 information.  It MAY actually be that
different firms are manufacturing the CPUs.

The following is from a post by Michael Slater of Microprocessor Report.
He posted this to comp.arch Dec, 31m 1990:

] LSI Logic's "Lightning" SPARC processor. Five-chip superscalar
] implementation, dispatches up to four instructions per clock. Uses out-of-
] order instruction execution, speculative execution, and register relabeling.
] 
] Texas Instruments' "Viking" SPARC processor. Superscalar and superpipelined,
] dispatches up to three instructions per clock. On-chip caches approximately
] 16 Kbytes each for instructions and data.
] 
] Cypress/ROSS Technology's "Pinnacle" SPARC processor. Superscalar, dispatches
] up to two instructions per clock cycle. On chip cache approximately 16
] Kbytes, external MMU and controller for second-level cache.
] 
] SPARC processors combining existing integer and floating-point units from
] Fujitsu and LSI Logic.

The good news is that there is SOME information in the SPARC Architecture
manual (Version 7) about Instruction scheduling.  I can't seem to find the
specific section number right now, but the gist of it (as I recall it) was
that the IU and FPU can execute instructions simultaneously.  Which means
that you can get a win by scheduling IU instructions alternately with FPU
instructions.

Now for specifics (where I have info)

>        How many cycles are needed to do a load and a store?
>        Is there any advantage (apart from needing only a single instruction
>        fetch) to the double-word loads and stores?

	CHIP				Cycle Times
				LD	LDD	ST	STD
	LSI L64811:		2	3	3	4
	Cypress CY7C601:	2	3	3	4
	Fujitsu MB86901:	2	3	3	4

The better news is that, yes, these 3 chipsets all happen to have the same
cycle times.  But you cannot guarantee this to be true in the future.  It
will be messy to write an instruction scheduler for a compiler which can
generate differently scheduled code for different chipsets by merely using a
different compile-time switch.  I think you will find that your biggest
performance gains will be in scheduling to fill stalls created by the slower
floating point instructions, FDIV, FMUL, and FSQRT.

Hope this helps.

Richard Gorton               rcg@lpi.liant.com  (508) 626-0006
Language Processors, Inc.    Framingham, MA 01760
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.