Path: utzoo!attcan!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!uw-beaver!uw-june!pardo
From: pardo@june.cs.washington.edu (David Keppel)
Newsgroups: comp.lang.misc
Subject: Re: Machine-independent intermediate languages
Summary: 3 of them, the key is where/how you throw away info.
Keywords: IR languages virtual machine
Message-ID: <6004@june.cs.washington.edu>
Date: 9 Oct 88 18:49:07 GMT
References: <e2uEl#2ORJzO=eric@snark.UUCP> <853@goofy.megatest.UUCP> <905@sword.bellcore.com> <5933@june.cs.washington.edu> <30151@oliveb.olivetti.com> <815@super.ORG>
Reply-To: pardo@cs.washington.edu (David Keppel)
Organization: U of Washington, Computer Science, Seattle
Lines: 88

>[ if IR doesn't support your langauge's feature, it hurts ]

( This is a slight formalization of something I've "known" for a while
but had never put together.  I hope this is useful to everybody).

My compilers prof introduced me to the ideas of "union IR" and
"intersection IR".  A union IR has primitive descriptions for every
primitive in all of the languages that it is supposed to support.  An
intersection IR has primitives for only those primitives common to all
languages.  There are a couple things to get out of this:

* The union...intersection IR describes a range of primitives to
  support langauges.  These in turn define a language.
* A machine-independent IR spec says nothing about the primitives
  supported by the machine.  Thus you may be throwing out efficient
  primitives available on some machine.
* A full union IR is hard to build and maintain and is probably quite
  inefficient.
* An instersection IR requires mapping complex operators (e.g., the
  APL "transpose" operator) down to a concatentaion of simple
  primitives (e.g., function call, nested loops).  Unless the simple
  operations can be re-recognized as the complex operation, many
  efficient compilation techniques cannot be used. 

Note that gcc, for instance, has 2 IR's: one is a machine-independent
parse-tree construction, and one is a *very* machine dependent
register transfer langauge.  The parse-tree IR is very source-language
dependent, while the machine-dependent IR depends very little on the
source language.  Thus we have (about) 3 languages (and thus 3 virtual
machine specifications):

    source

	source -> machine-independent translator
    
    parse trees

	semantic analysis
	machine-independent -> machine dependent IR translator
    
    register transfer language

	optimization
	code generation

In an ideal world, we create a machine-independent specification of a
machine-dependent RTL.  Note that GNU has chosen to ignore
segment-register segmented architectures in thiers.  We also need to
define a machine-independent but language-*dependent* parse tree
representation.

* To write a new language you must create a new machine-dependent
  tree representation or bend your language to fit an existing one
  (e.g., you can probably parse Pascal on to a Modula-(n)
  representation pretty easily, but the C representaion would need
  to be extended to support nested procedures).  The more information
  you need to pass on to the back end (e.g., value ranges for integer
  variables), the better a fit you need between the source and the
  parse language.  You could probably wind up with one parse language
  for Ada, C, Modula, Pascal, but would have a different language for
  Lisp, Scheme, T, etc.

* The source language to parse language translator and the semantic
  analyzer must be seperate for each language.

* Translators from parse language to register transfer language can
  be shared among languages that use a common parse language.

* A single port of the register transfer language ports all languages,
  independent of which (machine-independent) parse language is used.

This discussion originally came up in the context of "I want a
machine-independent `distribution' IR that can't recreate the
source."  Note that neither of the above languages (parse or RTL) fill
this bill.  The problem is that the parse language, by definition,
contains enough information to recreate the original code.  By the
time you've gotten to the machine-generation language, you've thrown
out a lot of this info, but (I assert, no proof is forthcoming) that
you can only (safely, and without throwing away information needed for
optimization) throw away information from the parse language in a
machine-dependent way.

I hope this is as enlightening to you as it was to me...

	;-D on  ( Machine-independent salad )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo