Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!know!news.cs.indiana.edu!ariel.unm.edu!ghostwheel.unm.edu!john
From: john@ghostwheel.unm.edu (John Prentice)
Newsgroups: comp.lang.fortran
Subject: Re: Missing the whole point (the Fortran vs. C debate)
Message-ID: <1990Dec6.193644.12920@ariel.unm.edu>
Date: 6 Dec 90 19:36:44 GMT
References: <28548@usc> <185951.1857@timbuk.cray.com> <TMB.90Dec5014523@bambleweenie57.ai.mit.edu> <1990Dec5.182145.2639@ariel.unm.edu> <9424@ncar.ucar.edu>
Sender: news@ariel.unm.edu (USENET News System)
Organization: University of New Mexico Math Dept., Albuquerque, NM
Lines: 123

In article <9424@ncar.ucar.edu> morreale@bierstadt.scd.ucar.edu (Peter Morreale) writes:
>In article <1990Dec5.182145.2639@ariel.unm.edu>, john@ghostwheel.unm.edu (John Prentice) writes:
>> 
>> Of couse, Cray Fortran has had constructs (the CDIR directives) for telling
>> the compiler to vectorize a loop since the beginning.  However, in general
>> I agree.  
>
>    The Cray Fortran compilers will vectorize *every* loop (which meets
>    vectorization criteria) by default.  The programmer doesn't need to
>    make any modifications to his code.  (although most do to obtain
>    increased performance, but non-portable constructs are not used or
>    needed)
>

This is true, but the problem is the 
vectorization crieria.  The Cray compiler is much better at sensing when a loop
is vectorizable than it used to be, but one can still construct cases 
where it is unable to resolve what appear to it to be vector dependencies but
which are in fact not.  That is why Cray provides the CDIR directives in
the first place.  An interesting flip side on the Cray is if you have a
short loop.  Often you need to inhibit vectorization because the overhead
exceeds the savings of vectorization.  If you loop with something like
             do 10 i=1,n
the compiler has no way to know that n is small and the loop should not
be vectorized.  You have to go in and tell it what to do by hand.   The
Convex compiler is alot better at vectorizing than the Cray one is by the
way.  It can vectorize nexted do-loops with it for example, something Cray
has never been able to do.  But Cray has never been famous for their
software.

>> With regard to expressing parallelism, the people at Myrias had
>> the easiest expression in Fortran that I have encountered yet.  If you
>> wanted a parallel do-loop, you said pardo instead of do.  That was it.  They
>
>    Sounds like a very non-portable construct.  The Cray method of
>    obtaining parallelism is to add directives which appear as Fortran
>    comment cards.   The directives are interpreted by source code
>    analyzers and translated into system calls.
>

No argument, the Myrias approach is not portable, but it is easy (of course
getting it to run efficiently may not be).  We handled the portability problem
using the C preprocessor to either use a do or pardo depending on the target
computer.

>> 
>> I don't know how Cray does parallel constructs.  But I hardly know anyone
>> who tries to do serious parallel programming on the Cray.  Unless you are
>> among the chosen few who can get dedicated Cray time, it is not cost effective.
>
>     Hummm...  I do question this statement.  If I am executing a code
>     on a Cray which only runs for a few minutes, you bet, it's not cost
>     effective.
>
>     How about ocean and climate models which run for literally
>     *hundreds* of Cray CPU hours?  (say a thousand wallclock hours) If
>     I can reduce the turnaround time by a factor of 3 or 4, 6, is it
>     worth it?  If I can, perhaps I can increase the resolution of the
>     model in the first place.  Does better science result from
>     increased resolution?   In addition, I get 64bit results on the
>     Cray for every calculation in single precision mode.  Is this
>     important?
>
>     In the Cray, I can get subroutines, and/or do loop iterations 
>     passed across multiple CPUs.
>

I have no quarrel with the goal of using parallelism to reduce any measure
of the time required by a calculation.  It has just been my experience (and
that of my colleagues at Sandia) that you don't get there using a Cray unless
the system is idle but for you job.  I do know of people doing it at Los Alamos
and getting good results using the YMP, but again, these people virtually own
those systems.  My point is not whether there is an
advantage to exploting parallelism (quite the opposite!), it is whether the
Cray is the system to do it on.  By the way, our applications take hundreds
of hours of Cray time also and our limitation is not wall clock time, it is
money.  Even using cheap DOE lab Cray computer time, these calculations
often cost us $50,000.  Cray parallelism has not helped there usually because
the only cost advantage it offers is reduction of the memory integral.  We
get charged per processor, so if we have N processors and the wall clock
time is now N times less (which it won't be obviously), the summed CPU time
is the same.  All that has been saved is the memory integral since it is
a shared memory system.  However, we have not been able to get all the 
processors at one time typically (on a crowded system), so we lose out there
too.  The final point I would make, the limitations on resolution are just
as serious due to unavailablity of memory and disk as they are by the 
time it takes to run a calculation.  Our finite difference codes use
meshes with many million cells, each carrying 20 or so variables.  You
don't have to dump very many cycles before you exceed the avaiable disk.
This is a problem facing all big computing that has not been adequately
addressed.

    
>> We tried it on the Cray 2 and on the YMP a couple years ago.  We were rarely
>> able to get all the processors at one time.  Also, you don't have very
>> many processors.  
>
>     So?  On a Connection machine with 64k processors, you only get the
>     parallel region of the code executed on those processors.  The
>     serial portion of the code is excecuted on a front-end.  For a
>     highly parallel code, you get good results, for a "typical" user
>     code, you get front-end speeds.  
>

Again, no question about it.   However, the sorts of calculations you
quoted earlier (weather, etc...) ARE highly parallelizable.  This is the
same argument people use against vectorization, yet I don't see any
massive rush from the scientific community to abondon it.  For the
"typical" user code, yes, you do get front-end speeds.  But given that
workstations give floating point performance within a factor of 2 to 5 of
a single processor YMP (for non-vectorized code, which the "typical"
user code is), why does the "typical" user need a Cray?  
The Cray is dynamite for the really big calculation that vectorizes like crazy 
or that requires tons of memory, but that is the exception, not the rule for 
"typical" users (who after all are most of the people out there).  The old
comment we used to make at Los Alamos was that 10% of the people in the
lab used 90% of the computing resources. 

John Prentice
Amparo Corporation
Albuquerque, NM

john@unmfys.unm.edu