Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!pasteur!ames!ucsd!rutgers!aramis.rutgers.edu!paul.rutgers.edu!jac From: jac@paul.rutgers.edu (J. A. Chandross) Newsgroups: comp.arch Subject: Re: Independent Architecture Complilers Message-ID: Date: 22 Apr 89 03:35:08 GMT References: <10441@polyslo.CalPoly.EDU> <424@bnr-fos.UUCP> <21331@prls.UUCP> Distribution: comp Organization: Rutgers Univ., New Brunswick, N.J. Lines: 114 cquenel@polyslo.CalPoly.EDU (34 more school days) writes: > >What if your machine only runs micro-code ? (This is not an idle >question). > weaver@prls.UUCP (Michael Weaver) > If your machine runs only microcode, it will generally be much simpler > to generate code for it than a machine that uses microcode to implement > an instruction set. This is indeed the case. Instruction sets are generally written once, but executed many many times. In order to deliver the highest performance you will likely want to write the code by hand. Besides, most microcoded instruction sets, even the VAX, are relatively simple compared to the features afforded by a true VLIW (ie horizontally microcoded) machine. However, if you want to generate user customizable instructions sets, or have user programs written entirely in microcode you will run into the problem of how to generate the microcode form a high-level language. It is bad enough having to debug the hardware with hand-written programs; forcing users to write in microcode means the top executives of your company are going to be selling real-estate in 6 months. However, programming disadvanatges aside, high-performance microcoded machines are likely to be the wave of the future. It is only with microcoded machines that you can take maximal advantage of your hardware. The RISC machines have merely proven what microarchitects have known since time immemorial: keep it single cycle, don't put a feature in if it will slow things down (even if your marketing people insist), don't put it in if you can make better use of the hardware, use parallelism to improve performance, and keep the hardware busy all of the time, etc.. And the devil take anyone who wants to program it by hand. (Of course, there are additional issues for microprogrammed machines like leave out pipelining because it makes it hard to write compilers for the machine as well as introducing needless complexity, handle branches intelligently, etc.) I'll construct a hypothetical machine to show what sort of performance gains it delivers and to demonstrate the demands it places on the compiler: 2 ALU's, conventional design, driveable in parallel 4 increment/decrement units. operations: add/subtract {1,2,4,nothing} to register memory access unit: {read,write} {8,16,32} bits offset is {register, constant, none} branch unit: jump, call subroutine, return from subroutine registers: 64 always accessible 64 accessible only through an ALU A 64 accessible only through an ALU B The most efficient code will use all these resources at the same time. Any compiler that will generate code for such a machine will require some sort of data flow analysis to determine how the various fields (ie an ALU op, branch, etc) can be compacted together to produce optimal code. For instance, the sequence: while(foo->next != NULL) { foo = foo->next; bar++; } Could compile into code like: R0 = foo R1 = offset for next loop: alu_1(compare(R0, NULL)) branch(equal, done); R0 = read(R0 + R1, Long) increment(R2,1) goto loop; done: But this is extremely inefficient. Instead, we can compact it to a 2 instruction loop: loop: alu_1(compare(R0, NULL)) branch(equal, done); R0 = read(R0 + R1, Long) increment(R2,1) goto loop; done: Now when you add in the complexity of folding in the instructions before and after the loop the compiler must understand a great deal about the target machine. After all, you now have scheduling problems. Recall that some registers are only accessible on certain ALUs. (These would be used to store commonly constants.) You also can have resource conflicts if various fields in your instruction are overlapped. For instance, you might discover that you typically do 1 alu operation and a memory operation or 2 alu operations. This would allow you to overlap the field for a memory operation with one of the alu fields. The problem grows as you add hardware. However, you can get performance with this sort of machine that you couldn't get out of a RISC chip. While the compiler problems are large, they are not insurmountable. Compilers have been written that generate tolerable code for machines like this. You need look no farther than the Multiflow or ELI-512 for proof. It is not clear to me exactly what model the current crop of commercial retargetable microcode compilers use. The research ones, ie the only ones that reveal their private parts to the world, tend to take a simplistic view of the world. I suspect that the commercial ones are more hype than substance, although I would be delighted to be proven wrong. Jonathan A. Chandross Internet: jac@paul.rutgers.edu UUCP: rutgers!paul.rutgers.edu!jac