Path: utzoo!utgpu!jarvis.csri.toronto.edu!clyde.concordia.ca!uunet!mstan!amull From: amull@Morgan.COM (Andrew P. Mullhaupt) Newsgroups: comp.sys.m88k Subject: Re: Information wanted on m88000 Risc workstations Summary: I'd feel great Tom! Just Great! Keywords: 80386 m88000 Everex Opus UNIX DOS Message-ID: <661@s5.Morgan.COM> Date: 10 Jan 90 05:13:11 GMT References: <641@s5.Morgan.COM> <25A64468.11498@paris.ics.uci.edu> <1879@xyzzy.UUCP> Organization: Morgan Stanley & Co. NY, NY Lines: 80 In article <1879@xyzzy.UUCP>, wood@dg-rtp.dg.com (Tom Wood) writes: > In article <648@s5.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt) writes: > > >2. That ratio of Megaflops to MIPS sucks. Let me rephrase this. Given > >that the 88000 is the only RISC chip with onboard floating support, > >you've got to wonder why since it ends up being (relatively) so > >slow. > > and later: > > > ...Right now I'm of a > >mind to get the 88000 if I can get good UNIX and some kind of > >floating point help. Otherwise, it's back to square one. Oh well. > > I'd like to entertain a discussion on the FP performance of the 88k. > I have yet to see a compiler that takes advantage of the pipeline > on this machine to any extent. Theoretically, you can have 5 FP adds > and 6 FP multiplies going on at once (if I understand correctly, the total > here is not 11, but 9: at most 5 FP adds or at most 6 FP multiplies and > no more than 9 total). So how would you feel if someone were able to > boost Mflops by a factor of say 3 (or better) by improving the compiler > technology? > > Here's a sample of what I'm talking about. These are computed values > for the Matrix multiply inner loop: > > DO 10 J = 1,N > 10 A(I,J) = A(I,J) + B(I,K)*C(K,J) > > Code Generation Technique Cycles/iteration Mflops > > Naive code 19 2.10 > Naive code, 2 unrolls 35/2 2.28 > Sophisticated, 4 unrolls 28/4 5.71 > Sophisticated, 8 unrolls 48/8 6.67 > > Well, how 'bout it!? A man after my own heart! I just finished bitching and moaning at the local C experts because the Sun 4 cc compiler produces the most stupid code I've (or after they saw it, they've) ever seen for the loop unrolling you've described. You actually give up a factor of three for no known reason! On the same hardware, gcc will take advantage of unrolled loops (e.g. Duff's device) to full effect. Too bad that there are situations which go the other way 'round. You will find another case for local optimization where RISC is often vulnerable is the question of inlining memcpy (strncopy, etc.). You want to 'unroll' this guy into int or even double transfers, but you've got to walk on eggs for alignment to support the full semantics. The 386/486 boxes are pretty good at this, and the SCO UNIX compiler (cc) for the 386 inlines a handful of standard library functions, and then generates some pretty smart assembler code. (It is necessary to point out that the behavior can be switched in and out by command line argument and preprocessor pragma - so if you depend on your own memcpy, etc., then you won't get hurt by an overzealous optimizer...) Now consider this code running on the 486. It's well known that the 486 can run all the 386 code (well if you've got a non-broken step 6 486 at least), but it is also almost as well known that the code sequences which are optimal for the 386 and 486 are sometimes different. There is even the question of code generation for the Cyrix replacement for the 80387 chip. It runs all the 80387 code unmodified, but there are ways to get the Cyrix to go another factor of two faster by generating different code. There are compilers, and libraries to take advantage of these situations, but I know of none for the 88000. On the other hand, I have heard that the 88000 is going to someday have a wider data path to it's floating point pipelines. Sounds like a good idea to me. So have you got a compiler which generates optimal code to get the other factor of two or three out of my code? Remember - I've already unrolled my loops, aligned my structures, and taken advantage of the FORTRAN calling sequence. Just like the Linpack benchmarks. Later, Andrew Mullhaupt