Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull
From: amull@Morgan.COM (Andrew P. Mullhaupt)
Newsgroups: comp.sys.m88k
Subject: Re: Information wanted on m88000 Risc workstations
Summary: I'd feel great Tom! Just Great!
Keywords: 80386 m88000 Everex Opus UNIX DOS
Message-ID: <661@s5.Morgan.COM>
Date: 10 Jan 90 05:13:11 GMT
References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <1879@xyzzy.UUCP>
Organization: Morgan Stanley & Co. NY, NY
Lines: 80

In article <1879@xyzzy.UUCP>, wood@dg-rtp.dg.com (Tom Wood) writes:
> In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes:
> 
> >2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given
> >that the 88000 is the only RISC chip with onboard floating support,
> >you've got to wonder why since it ends up being (relatively) so
> >slow. 
> 
> and later:
> 
> >		...Right now I'm of a
> >mind to get the 88000 if I can get good UNIX and some kind of 
> >floating point help. Otherwise, it's back to square one. Oh well.
> 
> I'd like to entertain a discussion on the FP performance of the 88k.
> I have yet to see a compiler that takes advantage of the pipeline
> on this machine to any extent.  Theoretically, you can have 5 FP adds
> and 6 FP multiplies going on at once (if I understand correctly, the total
> here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and
> no more than 9 total).  So how would you feel if someone were able to
> boost Mflops by a factor of say 3 (or better) by improving the compiler 
> technology?
> 
> Here's a sample of what I'm talking about.  These are computed values
> for the Matrix multiply inner loop:
> 
> 	DO 10 J = 1,N
>     10	    A(I,J) = A(I,J) + B(I,K)*C(K,J)
> 
> Code Generation Technique      Cycles/iteration      Mflops
> 
>     Naive code                      19                 2.10
>     Naive code, 2 unrolls          35/2		       2.28
>     Sophisticated, 4 unrolls       28/4		       5.71
>     Sophisticated, 8 unrolls       48/8 	       6.67
> 
> Well, how 'bout it!?

A man after my own heart! I just finished bitching and moaning at
the local C experts because the Sun 4 cc compiler produces the
most stupid code I've (or after they saw it, they've) ever seen for
the loop unrolling you've described. You actually give up a factor
of three for no known reason! On the same hardware, gcc will take
advantage of unrolled loops (e.g. Duff's device) to full effect.
Too bad that there are situations which go the other way 'round. 

You will find another case for local optimization where RISC is
often vulnerable is the question of inlining memcpy (strncopy, 
etc.). You want to 'unroll' this guy into int or even double
transfers, but you've got to walk on eggs for alignment to support
the full semantics. The 386/486 boxes are pretty good at this, and
the SCO UNIX compiler (cc) for the 386 inlines a handful of standard
library functions, and then generates some pretty smart assembler code.
(It is necessary to point out that the behavior can be switched in 
and out by command line argument and preprocessor pragma - so if you
depend on your own memcpy, etc., then you won't get hurt by an
overzealous optimizer...)

Now consider this code running on the 486. It's well known that the
486 can run all the 386 code (well if you've got a non-broken step
6 486 at least), but it is also almost as well known that the code
sequences which are optimal for the 386 and 486 are sometimes different.
There is even the question of code generation for the Cyrix replacement
for the 80387 chip. It runs all the 80387 code unmodified, but there are
ways to get the Cyrix to go another factor of two faster by generating
different code. There are compilers, and libraries to take advantage of
these situations, but I know of none for the 88000. 

On the other hand, I have heard that the 88000 is going to someday have
a wider data path to it's floating point pipelines. Sounds like a good
idea to me.

So have you got a compiler which generates optimal code to get the
other factor of two or three out of my code? Remember - I've already
unrolled my loops, aligned my structures, and taken advantage of the
FORTRAN calling sequence. Just like the Linpack benchmarks.


Later,
Andrew Mullhaupt