Xref: utzoo soc.culture.japan:5441 comp.sys.super:237
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!ncar!noao!arizona!rick
From: rick@cs.arizona.edu (Rick Schlichting)
Newsgroups: soc.culture.japan,comp.sys.super
Subject: Kahaner report -- 4th ISR Supercomputing Workshop, Hakone Japan.
Message-ID: <102@saguaro.cs.arizona.edu>
Date: 2 Oct 90 03:54:07 GMT
Followup-To: soc.culture.japan
Organization: U of Arizona CS Dept, Tucson
Lines: 404


  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Far East (ONRFE).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

To: Distribution
>From: David K. Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
      Tony F. Chan UCLA [chan@math.ucla.edu]
Re: 4th ISR Supercomputing Workshop 29-31 August 1990, Hakone, Japan.  
Date: 27 Sept 1990


ABSTRACT. This report describes the 4th ISR Supercomputing Workshop: The 
Road to Parallel Applications, held from August 29 to 31, 1990 in Hakone, 
Japan.  In addition, some observations on the trends and characteristics 
of parallel supercomputing research in Japan are presented.  

Most of the text of this report was prepared by Professor T. F. Chan 
Dept. of Mathematics, Univ. of Calif. at Los Angeles, CA 90024.  In some 
places I have inserted references to earlier reports of mine (DKK) when 
these supplement Chan's comments. Chan's travel expenses were supported 
by ISR and some local expenses were supported by my office, ONRFE.  


INTRODUCTION.
The Institute for Supercomputing Research (ISR) is a private non-profit 
research institute established in 1987 to "conduct research on issues in 
supercomputing and parallel processing, ... , and to strengthen ties with 
universities and research centers in Japan".  It is funded by the Recruit 
Corporation, which is a multi-billion dollar company in Japan whose main 
business is in recruiting college graduates for the major corporations 
but it also has a  division which sells  computer services.  The director 
is Dr. Raul Mendez, who has a Ph.D. from U.C. Berkeley under Alexander  
Chorin and who is well-known for some of the earliest benchmark tests on 
the Japanese supercomputers in the early 80's.  

The ISR has been organizing a series of annual workshops on various 
topics in supercomputing.  Typically, both Japanese and US researchers 
are invited.  Last summer it was held in Hawaii and this year the venue 
was Hakone, a resort  about 2 hours from Tokyo, famous for its  
hotsprings and the view of Mt. Fuji.  There were about 40 registered 
participants, mostly Japanese, with three speakers from the US: Olof 
Lubeck of Los Alamos, John Levesque of Pacific-Sierra Research and 
myself.  There were 13 talks total and a panel discussion on "The future 
and evolution of scientific computing".  A program is attached and an 
informal proceedings was available at the conference.  The atmosphere was 
relaxed but intimate, and there were many lively discussions both during 
and after the formal lectures.  

LECTURES.
Four main themes of the conference can be identified: parallel algorithms 
(with emphasis on PDEs), hardware (both general and special purpose) for 
scientific computing, dataflow, and computing  environments (languages, 
networks, programming tools).  This reflects the organizers' attempt to 
cover the main issues in parallel supercomputing and it mostly succeeded 
because there were many discussions during the workshop on how these 
areas should interact.  

Algorithms.
The numerical solution of partial differential equations (PDEs) 
represents a major demand for supercomputing resources and they are 
widely employed in many areas of science and engineering, as a result of 
the fundamental fact that most physical laws are expressed as PDEs 
mathematically.  It therefore makes sense to look at some of the basic 
PDE algorithms more carefully, especially in view of the advent of 
parallel computing.  Several speakers addressed this  issue.  Prof.  
Toshio Kawai of Keio University tried to convince the audience that 
nature is the best parallel supercomputer and it also provides a very 
powerful class of algorithms for these machines.  He calls these  
"natural algorithms" -- namely explicit in time algorithms which are 
based on local interactions in space.  He has produced a programming 
system called DISTRAN (written in PROLOG and publicly available), an 
ELLPACK-like system which allows the user to easily specify the PDE and 
obtain reliable results quickly. (See also my report 11 April 1990 in 
which this topic is also mentioned.  At that time I thought the idea was 
too good to be true.  Perhaps someone can request the program and perform 
a critical evaluation.  DKK) 

On the other hand, Chan's talk tried to argue that the most appropriate 
class of algorithms for massively parallel computers are hierarchical 
(multilevel) ones.  He based his arguments on the observation that many 
problems in nature are hierarchical in nature (e.g. having many different 
scales in time and space) and therefore the most efficient algorithms 
require some of form of global communication.  Hierarchical algorithms 
are a reasonable compromise between explicit algorithms, which are high 
parallelizable but slowly convergent, and fully implicit algorithms, 
which are fast convergent but difficult to parallelize. Besides they can 
be implemented efficiently on hierarchical parallel computers, such as 
the CM-2, the hypercubes and clustered hierarchical shared memory 
systems.  

Very often, existing algorithms for a particular problem are not 
naturally parallelizable and one has to devise novel parallel algorithms.  
Prof. Yoshizo Takahashi of Tokushima University presented  several such 
algorithms for a automated wire-routing problem specifically adapted to 
the Coral parallel computer, a binary tree distributed memory MIMD 
machine based on the MC68000 chip.  These algorithms are particularly 
interesting because they are true MIMD algorithms for a realistic 
unstructured problem running on a real parallel machine and they 
outperform the best commericial software  running on a SUN 3/260.  

A central issue in the design of parallel algorithms for MIMD computers 
is how to map the data into the processors so as to minimize data 
communication. George Abe of ISR presented results on comparing a ring 
mapping to a 2D mapping for a semiconductor device modelling problem on 
the iPSC/1. Comparisons with similar results  on an Alliant FX/8-4 are 
also given. He concluded that in two dimensions the difference in 
performance for the two mappings can be large, with the two dimensional 
mapping being more efficient.  


Hardware.
With the advent of multiprocessor systems with a relatively large number 
of off-the-shelf inexpensive processors, it has become increasingly easy 
and cost-effective to build special purpose hardware for special 
applications, as an alternative to conventional mainframe general purpose 
supercomputers.  Prof. Yoshio Oyanagi of the University of Tsukuba calls 
these "multi-purpose" computers.  Japan, long recognized for its 
manufacturing prowness especially in electronics and computers, is primed 
for following this approach.  

Physics seems to be the  primary field for which special purpose 
computers have been built.  Three machines of this kind were discussed at 
the conference.  The first is QCDPAX which is for QCD lattice 
simulations.  Apparently, the world-wide physics community has recognized 
the potential of parallel computing and several countries (including 
Italy, USA and Japan) have initialized projects to build special purpose 
hardware for this application.  QCDPAX is a MIMD machine with 432 
processing units, connected through a 2D nearest neighbor grid and a 
common bus.  Each processing element consists of a 32 bit microprocessor 
MC68020, a floating point chip L64133 and an LSI for vector operation, 2 
MB of fast memory and 4MB of slow memory.  Measured peak performance is 
12.25 Gflops.  For matrix vector multiplies, 5 Gflops is attainable.  For 
the QCD problem, a preconditioned conjugate gradient method is used.  The 
project was funded at a level of about two million US dollars for the 
FY87 to FY89.  A commerical product is now being marketed by the Anritsu 
Corporation (model DSV 6450, 4 sold). (See also reports on PAX and 
Anritsu, 11, 12 April 1990, and 28 April 1990, DKK).  

Another special purpose machine discussed (by J. Makino of the Dept. of 
Earth Sciences and Astronomy of the Univ. of Tokyo and ISR) is the GRAPE-
1 (GRAvitational PipE) developed at the University of Tokyo for  
gravitational N-body problems.  It is not really a computer in the usual 
sense because it is not programmable but instead is viewed as a backend 
computational processor for performing only the N-body force 
computations.  Effective performance of 120 Mflops has been achieved.  
The high performance derives from the use of three arithmetic pipelines 
corresponding to the three spatial co-ordinates.  An interesting feature 
is the use of variable precision: 8 bits for force calculations, 16 bits 
for positional data, and 48 bits for force additions.  A General Purpose 
Interface Bus (GPIB) connects the GRAPE-1 with the host (a Sony 
workstation).  This project is most impressive in its speed of 
completion.  The design started in March 89, the hardware was ready by 
September 89 and production runs began at the same time.  A follow-up 
GRAPE-2 project is now in progress, with parallel pipelines, and improved 
precisions (64/32 bits).  Makino estimates that a 50 board, 15 Gflops 
system can be built for US $100,000 and a 500 board, 150 Gflops system 
for US $300,000.  A GRAPE-3 system is also under design.  Following 
Makino, Junichi Ebisuzaki (Dept. of Earth Sciences and Astronomy of the 
Univ. of Tokyo) talked about adapting other many body simulations for the 
GRAPE system.  The basic modification needed is to accomodate the 
different forms of the force law.  He discussed applications in plasma 
physics and molecular dynamics.  


Prof. Nobuyasu Ito of the Department of Physics at  the University of 
Tokyo gave a seemingly exciting and entertaining talk (judged only from 
the reaction of the audience, since it was given in Japanese!), in which 
he described the m-TIS (Mega spin per second University of Tokyo Ising 
Spin) computer for simulating the many body problem arising from Ising 
systems.  A successor m-TISII system has also been built.  

Lest you think the Japanese supercomputer field is only producing special 
purpose hardware, rest assured that the really big boys have also been 
doing their homeworks.  Akihiro Iwaya of NEC described the NEC SX-3 
computer, which was widely reported in the US press as the fastest 
general purpose supercomputer today.  He reported that the performance 
ranges from 0.68 to 22 Gflop, depending on the particular computation 
performed.  The machine has a SIMD  architecture (which he estimated is 
sufficient to handle more than 80% of all applications), with shared 
memory (because "FORTRAN is based on shared memory") and up to four 
processors (he estimated that 16-32 such processors is within practical 
limits) each with multiple pipelined arithmetic processors.  He also 
discussed several system issues such as synchronization primitives, 
ParallelDo and ParallelCase statements, and micro/macro-tasking.  All in 
all a very Cray-like machine with blazingly fast peak performance.  (See
also reports on SX-3 25 April 1990, and 19 Sept 1990, DKK.)

Finally,  Shin Hashimoto of Fijitsu described the High Speed Parallel 
Processor (HPP),  which has been developed under a joint project between 
MITI and six computer companies (including Fujitsu, NEC and Hitachi) from 
1981 to 1990.  The main idea is to connect several conventional 
supercomputers (e.g. Fujitsu VP2000) via a Common Storage Unit (CSU) and 
a Large High-Speed Storage (LHS).  The data transfer rate between the HPP 
and the LHS is 1.5 Gbytes/sec.  The peak performance is over 10 Gflops.  
It comes with its own parallel language Phil, which has the usual 
parallel-do and lock and barrier statements, and a very user-friendly 
programming environment with execution viewers, cost analyzer and a 
parallel verifier.  Surprisingly, there has been no plan yet for turning 
it into a commercial product.  (See report of the highspeed project, 3 
July 1990, DKK.) 

Dataflow.
One of the most difficult tasks in designing parallel programming systems 
is the automatic detection and extraction of parallelism in programs.  
The dataflow approach has long been advocated as one model for achieving 
this goal and in a fundamental way  it is very attractive because it 
looks at the basic level of computation.  While the dataflow approach has 
not yet been demonstrated to be competitive in practice (practical 
dataflow machines are not exactly prolifilating at this moment), we 
should aim for the ideal nontheless, as Olaf Lubeck of the Computing 
Divison at the Los Alamos National Laboratory implored us to do in his 
talk.  He has been working closely with both the  group led by Arvind at 
MIT and the SIGMA-1 group at Electrotechnical Laboratory (ETL) of Japan.  
He claims that the main advantages of dataflow is that it produces 
deterministic computations and it extracts maximum parallelism.  In 
addition to some general comments about dataflow, he also discussed a 
more technical problem concerning how to "throttle" loop activations so 
that loops statements do not generate a big demand on system resources 
(i.e. memory) in the early iterations in a dataflow model.  (See also 
reports on ETL projects, 2 July 1990, 16 August 1990, DKK).  


Toshio Sekiguchi, also from ETL, described his efforts in designing the 
parallel dataflow language DFC II for the SIGMA-1 dataflow computer 
currently being developed.  The SIGMA-1 is an instruction-level dataflow 
machine, with 128 processing elements, 640 MIPS, 427 MFlops and 330 
Mbytes of memory.  DFC II is C based (functional langugages were 
deliberately not chosen because they want the language to be useful "for 
practical problems") and allows synchronization, global variables and, of 
course, automatic detection of parallelism.  The motto is: "sequential 
description, parallel execution".  Applications that have been run 
include QCD, PIC, Keno and LINPACK.  

Environment.
It is wide recognized that one of the potential stumbling blocks on the 
road to the utopia of parallel computing for the masses is that parallel 
programming is an order of magnitude more difficult than vector 
programming, not to mention sequential programming.  Without user-
friendly and yet powerful programming enviroments, parallel computing may 
never reach the promised land.  One of the main themes of the workshop is 
on environments.  

John Levesque of the Pacific Sierra Research Corp. (PSR) was the main 
speaker on this issue.  John is one of the leaders in this field and he 
had just published a book on optimization techniques for supercomputers.  
He described the philosophy behind the FORGE and MIMDizer systems that 
have been  developed at PSR.  FORGE is an integrated environment 
consisting of program development modules, static and dynamic performance 
monitors, sequential and parallel debugging, memory mapping modules, 
automatic optimization and a menu driven interface.  John stressed the 
importance of building a database of information about the program and 
collecting both static and runtime statistics in order to optimize 
performance.  MINDizer is a brand new system scheduled to be delivered 
this October.  As the name suggests, it is designed for easing the 
porting of programs to distributed memory MIMD machines.  The key idea is 
"array decomposition", i.e. the user specifies the mapping of data arrays 
and MIMDizer handles automatically all communication interfaces.  This 
appears to be a very practical approach between automatic parallel 
compilers and explicit data mapping and message passing by the user.  

Anyone of us who uses electronic mail realizes the importance of 
networks. But networks can also play a critical road in the computing 
environment for supercomputing in the near future, according to Raul 
Mendez in his banquet  talk.  His dream is "supercomputing from a laptop" 
--- and the way to achieve that is through networks.  He discussed the 
existing networks in the US and Europe, as well as the several networks 
being developed in Japan  and over the Pacific.  


PANEL DISCUSSION.
The most lively discussions of the whole workshop occurred during the 
panel discussion, which should come as no surprise when one considers 
that the theme was: "The Future and Evolution of Scientific Computing", 
obviously a subject matter very dear to every participants' heart.  The 
panelists were: Genki Yagawa (Dept. of Nuclear Eng., Univ. of Tokyo), 
Katsunobu Nishihara (Inst. of Laser Eng., Osaka Univ.), Kida (Kyoto 
Univ.), D. Sugimoto (Univ. of Tokyo), and four of the speakers: Lubeck, 
LeVeque, Chan, and Oyanagi.  

Mendez led off with the three main topics for discussion:
 1.  What will computational requirements be like in the next decade?
 2.  What is the outlook for SIMD and MIMD architectures?
       Shared versus distributed memory?
 3.  What other trends will come to play a significant role: dedicated 
       machines, dataflow architectures, micropressors, etc.?  

Concerning Question 1 above, it is clear from the discussions that 
everyone thinks that there is no forseeable upper bound to the 
computational requirements for supercomputers; in fact the demand is 
limited by the current supercomputers at any one moment in time.  Even 
with a teraflop machine, practical engineering computations (100^3 grids, 
with 3 variables for point) could still require one hour of CPU time.  
And it will require enormous amount of memory.  In fact, the cost of 
memory may be a major barrier to building a teraflop machine: assuming a 
scaling law of 1 Mbytes per 1 Mflops, a teraflop machine will require 
about 20 billion dollars today just for the memory!  Developments in 
algorithm design will also have to follow the pace of hardware and 
architectural advances (as it has been throughout the history of 
computing).  

Concerning Question 2, some interesting consensus emerged.  While some 
panelists think that the SIMD architecture is sufficient for many 
problems (e.g. QCD), many personally prefer MIMD machines for their 
flexibility.  The most likely trend will be hybrid (or cluster, 
hierarchical) architectures, with MIMD at the higher levels and SIMD at 
the lower levels.  Concerning memory architecture (shared or 
distributed), many believe that hiding the storage structure of data will 
undoubtedly lead to performance degradation and therefore some user input 
is essential.  No one believes we'll see automatic and efficient 
compilers for parallel machines in the forseeable future.  

Concerning Question 3, our representative from the dataflow camp (Lubeck) 
said that ignoring dataflow will be settling for second best and we 
should be  "going for the gold", even though that may take some time.  
Someone pointed out also that while current research has primarily 
focused on the solution techniques, other aspects of the scientific 
computing process, such as mesh generation and visualization, will be 
playing a more important role in the future.  And finally, while parallel 
machines are much more difficult to use than vector machines, users are 
willing to plunge in when given sufficient incentive (e.g. cost 
effectiveness of the CM-2).  


OBSERVATIONS (Chan).

As someone who works on parallel algorithms, the most obvious thing was 
the small number of talks on this topic.  I realize that this could be 
just a feature of this particular workshop, but in general I have not 
been aware of an active research community in parallel algorithms 
development in Japan.  

On the other hand, the hardware development in Japan has been truly 
impressive, both in terms of raw power and the speed and low cost at 
which special purpose machines are built.  However, I did not see  much 
in architectural innovations, and most of the designs follow trends 
already established in the industry.  During the banquet, I was informed 
by a Fujitsu engineer that the company is building Japan's first 
commercial distributed memory MIMD machine --- from the terse description 
it resembles the several US hypercubes (1K processors, SPARC chip, grid 
connection topology and "wormhole" routing.) 

Another observation that I made was that many of the talks were based on 
work by interdisciplinary teams, consisting of physical scientists who 
have real problems to solve and hardware and software computer designers.  
In fact, Japanese physicists seem to play a very active role in parallel 
computing --- all the special machines mentioned were built for physics 
problems.  Even though there were several academic engineers on the 
panel, I could not tell how big an influence they have had in this field 
in Japan.  

Overall, attending this workshop was a very pleasant experience for me.  
I met many interesting people (and everyone was very friendly and open) 
and my hosts Raul Mendez and Chris Eoyang were most gracious.  I just 
wished my knowledge of Japanese was better than just reading of Kanji so 
I could understand all the jokes during the few talks delivered in 
Japanese!  


(The observations above are Chan's. Nevertheless they mostly echo my own 
feelings and I have often made similar remarks in my reports.  In fact, 
readers should note that many of the presentations describe work very 
close to that published or presented elsewhere.  However, I do not agree 
entirely with the comment about architectural innovation. There are only 
a few really different computer organizations.  Innovation (as opposed to 
revelation) comes from figuring out how to design so that all the pieces 
work harmoniously. The Japanese researchers seem at least as capable as 
those in the west in finding methods to do this. DKK) 


PROGRAM: 4th ISR Supercomputing Workshop

Raul Mendez (ISR) Opening Remarks
Toshio Kawai (Keio University) "Standard Solutionf to Partial 
   Differential Equations on Supercomputers"
Yoshizo Takahashi (Tokashima University) "Parallel Automated Wire-Routing 
   With a Number of Cometing Processors"
Geroge Abe (ISR) "Partial Differential Equation Solvers and Architectures 
   for Parallel Scientific Computing"
Toshio Sekiguchi (Electrotechnical Laboratory) "The Design of the 
   Practical Language DPCII and its Data Structures"
Olaf Lubeck (Los Alamos National Laboratory) "Resource Management in 
   Dataflow: A Case Study Using Two Numerical Applications"
Yoshio Oyanagi (University of Tsukuba) "QCD Lattice Simulations With the 
   QCDPAX"
Daiichiro Sugimoto & Junichi Ebisuzaki (University of Tokyo) "Project 
   GRAPE and The Development of a Specialized Computer for the N-body 
   Problem"
Nobuyasu Ito (University of Tokyo) "A Trial to Break Through the Many 
   Body Problem With a Computer"
Panel: G. Yagawa, K. Nishihara, D. Sugimoto, Y. Oyanagi, O. Lubeck, J. 
   Levesque, R. Mendez (moderator)
Akihiro Iwaya (NEC Corp) "Parallel Processing on the NECSX-3 
   Supercomputer"
Shin Hashimoto (Fujitsu Ltd) "Parallel Application Development on the 
   HPP"
John Levesque (Pacific Sierra Research) "An Advanced Programming 
   Environment"
Raul Mendez (ISR) Closing Remarks

----------------------END REPORT-----------------------------------------