Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!pasteur!ames!ucsd!rutgers!aramis.rutgers.edu!paul.rutgers.edu!jac
From: jac@paul.rutgers.edu (J. A. Chandross)
Newsgroups: comp.arch
Subject: Re: Independent Architecture Complilers
Message-ID: <Apr.21.23.35.07.1989.2593@paul.rutgers.edu>
Date: 22 Apr 89 03:35:08 GMT
References: <10441@polyslo.CalPoly.EDU> <424@bnr-fos.UUCP> <21331@prls.UUCP>
Distribution: comp
Organization: Rutgers Univ., New Brunswick, N.J.
Lines: 114

cquenel@polyslo.CalPoly.EDU (34 more school days) writes:
>
>What if your machine only runs micro-code ?  (This is not an idle
>question).  
>

weaver@prls.UUCP (Michael Weaver)
> If your machine runs only microcode, it will generally be much simpler 
> to generate code for it than a machine that uses microcode to implement
> an instruction set.

This is indeed the case.

Instruction sets are generally written once, but executed many many times.
In order to deliver the highest performance you will likely want to write
the code by hand.  Besides, most microcoded instruction sets, even the
VAX, are relatively simple compared to the features afforded by a true 
VLIW (ie horizontally microcoded) machine.

However, if you want to generate user customizable instructions sets, 
or have user programs written entirely in microcode you will run into 
the problem of how to generate the microcode form a high-level language.
It is bad enough having to debug the hardware with hand-written programs;
forcing users to write in microcode means the top executives of your 
company are going to be selling real-estate in 6 months.

However, programming disadvanatges aside, high-performance microcoded 
machines are likely to be the wave of the future.  It is only with 
microcoded machines that you can take maximal advantage of your hardware. 

The RISC machines have merely proven what microarchitects have known
since time immemorial:  keep it single cycle, don't put a feature in
if it will slow things down (even if your marketing people insist),
don't put it in if you can make better use of the hardware, use
parallelism to improve performance, and keep the hardware busy all of
the time, etc..  And the devil take anyone who wants to program it by
hand.

(Of course, there are additional issues for microprogrammed machines like 
leave out pipelining because it makes it hard to write compilers for the 
machine as well as introducing needless complexity, handle branches 
intelligently, etc.)

I'll construct a hypothetical machine to show what sort of performance
gains it delivers and to demonstrate the demands it places on the compiler:

2 ALU's, conventional design, driveable in parallel
4 increment/decrement units. operations:
	add/subtract {1,2,4,nothing} to register
memory access unit:
	{read,write} {8,16,32} bits offset is {register, constant, none}
branch unit:
	jump, call subroutine, return from subroutine
registers:
	64 always accessible
	64 accessible only through an ALU A
	64 accessible only through an ALU B

The most efficient code will use all these resources at the same time.
Any compiler that will generate code for such a machine will require some
sort of data flow analysis to determine how the various fields (ie an
ALU op, branch, etc) can be compacted together to produce optimal code.
For instance, the sequence:

	while(foo->next != NULL) {
		foo = foo->next;
		bar++;
		}

Could compile into code like: 

R0 = foo
R1 = offset for next

	loop:	alu_1(compare(R0, NULL))
		branch(equal, done);
		R0 = read(R0 + R1, Long)
		increment(R2,1)
		goto loop;
	done:

But this is extremely inefficient.  Instead, we can compact it to a
2 instruction loop: 
	loop:	alu_1(compare(R0, NULL)) branch(equal, done);
		R0 = read(R0 + R1, Long) increment(R2,1) goto loop;
	done:

Now when you add in the complexity of folding in the instructions before
and after the loop the compiler must understand a great deal about the
target machine.  After all, you now have scheduling problems.  Recall
that some registers are only accessible on certain ALUs.  (These would
be used to store commonly constants.)  You also can have resource
conflicts if various fields in your instruction are overlapped.  For
instance, you might discover that you typically do 1 alu operation and a
memory operation or 2 alu operations.  This would allow you to overlap
the field for a memory operation with one of the alu fields.  The problem 
grows as you add hardware.  However, you can get performance with this sort 
of machine that you couldn't get out of a RISC chip.

While the compiler problems are large, they are not insurmountable.  
Compilers have been written that generate tolerable code for machines like
this.  You need look no farther than the Multiflow or ELI-512 for proof.

It is not clear to me exactly what model the current crop of commercial
retargetable microcode compilers use.  The research ones, ie the only
ones that reveal their private parts to the world, tend to take a 
simplistic view of the world.  I suspect that the commercial ones are 
more hype than substance, although I would be delighted to be proven 
wrong.


Jonathan A. Chandross
Internet: jac@paul.rutgers.edu
UUCP: rutgers!paul.rutgers.edu!jac