Xref: utzoo soc.culture.japan:5842 comp.sys.super:260
Path: utzoo!attcan!uunet!fernwood!apple!usc!elroy.jpl.nasa.gov!ncar!noao!arizona!rick
From: rick@cs.arizona.edu (Rick Schlichting)
Newsgroups: soc.culture.japan,comp.sys.super
Subject: Kahaner Report: Parallel Computing in Japan (Part 2)
Message-ID: <119@saguaro.cs.arizona.edu>
Date: 6 Nov 90 01:56:08 GMT
Followup-To: soc.culture.japan
Organization: U of Arizona CS Dept, Tucson
Lines: 728


  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Far East (ONRFE).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp]
      H.T. Kung CMU [ht.kung@cs.cmu.edu]
Re: Aspects of Parallel Computing Research in Japan---NEC & Fujitsu.
Date: 6 Nov 1990

ABSTRACT. Some aspects of parallel computing research in Japan are
analyzed, based on authors' visits to a number of Japanese universities
and industrial laboratories in October 1990. This portion of the report
deals with supercomputing and parallel computing at NEC and Fujitsu.

PART 2.

The following outline describes the topics that are discussed in the
various parts of this report.

PART 1 OUTLINE------------------------------------------------------------
  INTRODUCTION
  SUMMARY
  RECOMMENDATIONS 
  
PART 2 (this part) OUTLINE------------------------------------------------
  FUJITSU OVERVIEW
    Company profile and computer R&D activities
    VP2000 series supercomputer organization and performance
    PARALLEL PROCESSING ACTIVITIES
     SP (Logic Simulation Engine)
     AP1000 (Cellular Array Processor)
     RP (Routing Processor)
     ATM (Asynchronous Transfer Mode) Switch
    MISCELLANEOUS FUJITSU ACTIVITIES
     Neurocomputing
     HMET 

  NEC
    SX-3 series supercomputer organization and performance
      Benchmark data for SX-3, VP2000, and Cray.
      Comments
    MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES

PART 3 OUTLINE------------------------------------------------------------
  HITACHI CENTRAL RESEARCH LABORATORY
    HDTV
    PARALLEL AND VECTOR PROCESSING
      Hyper crossbar parallel processor, H2P
      Parallel Inference Machine, PIM/C
      Josephson-Junctions
      Molecular Dynamics

   JAPAN ELECTRONICS SHOW, 1990
     HDTV
     Flat Panel Displays

   MATSUSHITA ELECTRIC
     Company profile and computer R&D activities
     ADENA Parallel Processor
     MISCELLANEOUS ACTIVITIES
       HDTV
     Comments about Japanese industry

PART 4 OUTLINE-----------------------------------------------------------
    KYUSHU UNIVERSITY
      Profile of Information Science Department
      Reconfigurable Parallel Processor
      Superscalar Processor
      FIFO Vector Processor
      Comments

    ELECTROTECHNICAL LABORATORY
      Sigma-1 Dataflow Computer and EM-4
      Dataflow Comments
      CODA Multiprocessor

    NEW INFORMATION PROCESSING TECHNOLOGY
      Summary
      Comments

    UNIVERSITY OF TSUKUBA
      PAX

    SANYO ELECTRIC
      Company profile and computer R&D activities
      HDTV
END OF OUTLINE----------------------------------------------------------


FUJITSU OVERVIEW.
Currently about a $16Billion US corporation (based on 158Yen/$), with
sales and income growing about 10%/year.  As with most Japanese
companies, Fujitsu includes many subsidiaries (Fujitsu Laboratories,
Fujitsu Business Systems, Fujitsu America, etc.), and affiliates, and
has about 115,000 employees, about 50,000 in Fujitsu proper, the
remainder in associated companies. R&D expenses are about 12% of sales
and have been increasing more rapidly than sales growth.  Corporate
sales are divided as follows.

              Computers             66% 
              Communications        16 
              Electronic devices    14
              Other                  4 
The most important factor in sales growth was the rapid growth in
overseas (outside Japan) sales, now accounting for about one fourth of
the total. The company states that major strategic objectives are to
strengthen activities in information management, and further globalize
the company. Recently they purchased 80% of British based ICL
(International Computers Ltd). Global research and development,
including software development is mentioned as a specific goal.

The company develops and markets a wide range of computers and related
peripherals such as disk subsystems, including a 32 workstation with
built in CD-ROM with secretary-friendly video and sound, FM-Towns,
(apparently available only in Japan) to a large scale supercomputer,
VP2000 series, whose deliveries began spring 1990. A vast range of
semiconductor devices, memories, etc.  and other new technologies, are
sold outside the company and also used in Fujitsu specific products. For
example, Sun SPARK chips were originally purchased directly from
Fujitsu. The company is also very active in important areas of switching
and telecommunication technologies related to HDTV, digital switching
systems, etc.  Fujitsu is also researching high compression rate
encoding for visual telephones and TV conferencing, as well as encoding
methods for HDTV and variable rate encoding methods for future packet
communications.

The main research arm of Fujitsu is the Fujitsu Laboratories, a
subsidiary corporation that operates two laboratories, one in Kawasaki
and the other in Atsugi, both in suburban Tokyo. Total employment is
about 1500. The Atsugi lab, established in 1983 is responsible for
research in areas of electron devices, electronic systems, and advanced
materials. The Kawasaki lab, established in the mid 1960s is on the
grounds of some other Fujitsu facilities, so that the total working
population there is over 12,000. The Kawasaki lab concentrates on
information processing, communication, space, and personal systems. The
overall educational background of the laboratories is interesting.

           Electronics             48% 
           Physics                 19 
           Computer Science        10
           Chemistry               10
           Mechanical Engineering   5
           All others               8
This is certainly one reason for the wealth of activities in hardware
relative to software.  Half of the staff have Masters degrees; only 10%
hold doctorates.

As mentioned above, Fujitsu is working hard to be a global corporation.
That means both R&D and manufacturing outside of Japan. For example,
Fujitsu signed a five year joint research agreement in October 1989 with
the Australian National University in Canberra.  Subjects include
advanced computers, both large scale supercomputers and more exotic
parallel computers, and computer vision using the visual mechanism of
insects.  Another global research project is with the German software
company Aris, to develop software for automatic translation of Japanese
technical materials and documents into German. When complete, the system
will contain a dictionary, syntax for generating German, and appropriate
development tools for both the dictionary and the syntax. Various
natural language processing and voice recognition systems are also under
study, as is a real-time fingerprint sensor system using holography, and
an on-line handwritten input system claimed to be able to correctly
recognize Kanji, Katakana and Hiragana Japanese characters.
Unfortunately we had no opportunity to see any of these last projects.

Fujitsu computers are heavily used in the mainframe world. The company's
efforts in large scale supercomputers are interesting.  More than 100
orders have been received for computers in the VP2000 series. The most
powerful model, the VP2600 has a maximum performance of about 5
gigaflops. According to Fujitsu at least one VP2000 has been installed
in Kodak headquarters in Rochester NY.

What follows is a brief summary Fujitsu VP2000 series supercomputers.
Fujitsu offers four models in this series, as follows.

VP2100 /10, /20 (peak performance 0.5 GFLOPS)
VP2200 /10, /20 (peak performance 1.0 GFLOPS), /40 (peak 2.0 GFLOPS)
VP2400 /10, /20 (peak performance 2.0 GFLOPS), /40 (peak 5.0 GFLOPS)
VP2600 /10, /20 (peak performance 5.0 GFLOPS)

Models designated as /10 have one scalar and one vector arithmetic unit.
Models designated as /20 have two scalar and one vector arithmetic units.
Models designated as /40 have four scalar and two vector arithmetic units.
The /10 and /20 systems are uniprocessor, the /40 is multiprocessor.
Their nomenclature is mildly confusing, as the designation /x0
corresponds to the number of scalar rather than vector units, even
though the latter determine peak performance.

Fujitsu is deeply interested in multiprocessing; one indication has been
their MITI-sponsored research jointly with NEC and Hitachi, called
informally the HPP project, involving four VP2600s each operating as a
uniprocessor attached to a very large shared buffer memory.  Fujitsu
claims that such a large multiprocessor was developed mainly to
demonstrate their success with room temperature HMET devices (see below)
as the communications drivers between the computers and memory.
Nevertheless, using this, a NEC researcher was able to solve a very
large system of 32K linear equations in less than 11 hours.  For more
details see Kahaner's report 21 June 1990, "japgovt".

Fujitsu is probably experimenting on a /40 multiprocessor for the
VP2600, but has not released any public information about this.  Without
a /40 for the VP2600, Fujitsu's VP2000 series peak performance (however
unrelated to actual performance) will fall short of current competition
from NEC as well as new machines from Cray, and perhaps others. In the
meantime though, the VP2000 series come in a variety of colors,
including Elegance Red, Future White, and Florence Green.

Peak performance of the /10 and /20 models in any line are the same, as
this is determined entirely by vector processing.  Peak performance can
easily be computed once the machine cycle time and the maximum possible
number of simultaneous floating point operations are known.  For
example, the VP2400/40 and VP2600 each have cycle times of 3.2
nanoseconds.  To achieve the advertised 5.0 GFLOPS peak implies 16
simultaneous floating point operations. For the VP2400/40 this requires
eight per vector unit, while for the VP2600/20 sixteen simultaneous
operations are required.  Each of Fujitsu's vector units is described as
having two arithmetic pipes, but in reality they are more complicated.
Each pipe is capable of simultaneously performing both an addition and a
multiplication. In addition the pipes effectively deliver twice
(VP2400/40) or four times (VP2600/20) as much data. Thus each pipe on
the VP2600/20 can produce the result four floating point additions and
four floating point multiplications per cycle. This is similar to the
"superword" concept on the ill fated Cyber 205. Of course, if a
calculation is dyadic, that is does not involve both a multiplication
and addition, then the peak performance will be reduced by 50%.

By studying the performance of VP2000 machines on typical job streams it
has been  observed that when the scalar unit is 100% in use, the vector
unit is about 50% to 75% busy. Thus the addition of a second scalar unit
can significantly increase throughput, and was presumably Fujitsu's
reason for adding it.  However, for any single user problem it might not
be possible to keep the vector unit constantly busy. Thus the most
practical environment for such a setup would be a computing center or
other multi user job shop, where several user jobs can be run
simultaneously. Kyoto University, a typical busy university computing
center, will be getting a VP2600/10 soon. We asked about why only one
scalar processor. Although the university made a very strong case for
two scalar processors, the Ministry of Education decided (based on
budgetary, or other, grounds) to only support the one scalar processor
system. However it is an easy field upgrade to add the second scalar
unit. The choice of a VP2600/10 rather than a VP2400/40 was a matter of
policy; Kyoto has always tried to purchase the fastest machine
available. It is also possible that they would like to upgrade
eventually to a multiprocessor 2600 when this is available.

As is the case with most of today's vector supercomputers, data to and
from the vector arithmetic units need to pass through vector registers.
In the VP2600 these registers have a capacity of 128KB (64 elements
times 256 registers times eight byte data) but can be concatenated in
various ways, for example as 2048 times 8 times eight byte instead. Thus
the organization of the registers is very flexible. To get data between
memory and the vector registers Fujitsu only provides two load/store
pipelines. This could be a bottleneck, although the register flexibility
may allieviate it to a certain extent. Memory to register bandwidth has
been criticised in the VP2000 series, but at least one new benchmark,
given below, suggests that Fujitsu has been making efforts to deal with
this.  The computation of interest is that of multiplying large matrices
A=B*C, each of which is 4096 by 4096, with real 64 bit floating point
components. The source program is written in 100% standard Fortran but
is organized to take advantage of the two pipe structure of the VP2000
architecture in a very clear way.  The essential segment of the source
program consists of first zeroing the target array.

        DO 4000 J=1,4096
         DO 4000 I=1,2048
           A(I,J)=0.0
           A(I+2048,J)=0.0
   4000 CONTINUE
     
Then the actual multiplication is as follows.

        DO 5000 L=0,1
         DO 5000 J=1,4096
          DO 5000 K=1,4096,4
           DO 5000 II=1,2048
             I=II+(2048+L)
             A(I,J)=A(I,J)+B(I,K)*C(K,J)+B(I,K+1)*C(K+1,J)
       *            +B(I,K+2)*C(K+2,J)+B(I,K+3)*C(K+3,J)
   5000 CONTINUE

In this case the matrices are large enough that there is significant
memory to register to memory traffic.  Nevertheless, Fujitsu's FORT77/VP
compiler is able to vectorize this effectively and generate 4.8 GFLOPS,
96% of peak performance.

One comment is worth making here. At the InfoJapan 90 meeting a lecture
was presented by Nobuo Uchida, from the Mainframe division of Fujitsu,
on the architecture of the VP2000 series computers. We found it
particularly interesting that his paper made no mention of the /40
series in the VP2000 lineup.  The English product announcement about the
/40 had been distributed shortly before the meeting, and the Japanese
announcement was available weeks before that.  Because the /40 is a
multiprocessor, it represents a most important addition to their product
line. The characteristics and properties of new advanced computers are
of real interest to the research community, especially those who travel
long distances to hear about them.  Perhaps there was a manuscript
revision that we did not notice. Nevertheless, it was disappointing that
this new system was not included in his discussion. Perhaps it is
related to Fujitsu's silence about a VP2600 multiprocessor.


FUJITSU'S ACTIVITIES IN PARALLEL PROCESSING. 

In our recent visit to Fujitsu Laboratories, we visited the 
following three  parallel processing projects.
(1) SP (Logic Simulation Engine). This is a special purpose 64 processor
event driven parallel computer designed to test the logic design of VLSI
chips before they are built.  It is claimed that it has larger capacity
than any other simulator and that simulation times are about 30 times
faster than using Fujitsu's 780 mainframe. Testing a 1MB gate chip takes
about 4 hours on the SP, and this is 1000 times faster than the 780.
The SP is implemented in TTL, with gate arrays for the ECC
implementation. (Fujitsu can build 200K gate, 331-pin arrays currently.)
Ten SP machines have been built, and 2 are in use by Amdahl in the U.S.
The others are for internal use.  Fujitsu claims that partly due to its
use of event driven simulation, SP is 100 times faster than the IBM
Yorktown Simulation Engine and feels that the SP is a successful effort.
(NEC Corp also has a logic simulator, Hal II and TDHal.) It seems that
most computer companies in Japan have developed their own special
purpose parallel engines for logic simulation for their internal use.

(2) AP1000, renamed from older CAP (Cellular Array Processor).  This is
composed of up to 1024 cells or processors.  Each cell is composed of a
SPARC chip (for ease of software development), Weitek floating point
unit and gate array router running at 25MHz, and 16MB of memory.  Cells
can communicate using wormhole routing in a two dimensional mesh using
25MB/sec channel.  The standard structured buffer pool is used to avoid
deadlocks.  The network also supports row and column broadcasting.  The
router and SPARC connection is 40 MBytes/sec.  Since the connection is
also shared by the CPU cache, the actual available bandwidth is still
under evaluation.  In addition, a special frame buffer can read out from
each cell so that image data can be partitioned up among cells
efficiently. Maximum performance is 12.5MFLOPS/cell, and 12.8GFLOPS for
a fully configured 1024 cell system.  AP1000 has good (but not
spectacular) communication and good numerical performance potential.
Fujitsu expects that it will typically be connected to a Sun-4 as a host
via a VME bus. This project has been going on for a number of years
under the old name CAP.  (CAP is also the name of the Cellular Array
Processor developed by Mitsubishi Electric for satellite image
processing. As far as we know there is no relation between these
projects.) A team of about 10 people have been at work on the AP1000 for
two years.  The new AP1000 system is much more powerful, primarily
because of the use of SPARC chips and Weitek floating-point chips.  In
contrast, the old system used Intel 80186 chips.  Present plans are to
begin production this fall with installation of 7 or 8 machines in
spring of 1991.  Of these, most are to be 64-cell systems and one is to
be a 512-cell or 1024-cell system.  (A 1,024-cell system is scheduled to
be built in April 1991.) Currently a 16-cell system is running. The
64-cell system, with about 800 MFLOPS peak performance should cost the
company about $300K U.S.

We were shown a straightforward ray-tracing example which is a perfect
candidate for data parallelism.  The system currently has a home-made
run-time system, and no parallelizing compiler for either C or Fortran.
We were told that in addition to scientific computing, visualization,
and CAD, one potential application was for design rule checking, but in
that case it isn't clear why floating point is necessary.  The
Australian National University will get a 128-node AP1000 system and
will help with software development and evaluation.  (Contact: Prof. M.
McRobbie [mam@arp.anu.edu.au]).

As with the earlier CAP project, Fujitsu has a nice color sales brochure
about the AP1000, but this is still considered an experimental machine.
Probably its most important uses will be internal to Fujitsu, similar
to the SP model. We feel that the project is probably a few years behind
similar work at leading research places in the U.S., primarily because
of the differences in software and interprocessor communications
capabilities.

Two contacts for this project are given below.
             Mitsui Ishii [mishi@flab.fujitsu.co.jp]
             Hiroyuki Sato [hsat@flab.fujitsu.co.jp]
             Fujitsu Laboratories
             1015 Kamikodanaka Nakahara-ku
             Kawasaki 211, Japan             
               Tel: (044) 777-1111, -2327
             
(3) RP (Routing Processor).  This is a special-purpose SIMD machine to
implement the maze routing.  A performance goal is to route large (e.g.,
100K-gate) gate arrays in approximately one hour.  To implement the
machine, bit-serial PEs (Processing Elements) are used.  A 4K-PE system
is operational.  We saw a successful demonstration of the system in
doing a difficult switch box routing.  Since in maze routing only PEs on
the wave front are active at a given time, the system will typically
multiplex four "logical" PEs onto each "physical" PE to ensure efficient
utilization of physical PEs.  Approximately 5 people have been working
on the RP project for two years.  They are currently building a 16K-PE
RP.

A challenge of using special purpose CAD engines such as the RP is its
graceful integration with the rest of the CAD system.  Also, it is not
clear about how the RP can take advantage of hierarchical information
available in a design.  Fujitsu researchers are looking at these issues.

ATM Switch. 
In addition to the three parallel processing projects described above,
we also visited a major project on the development of an ATM
(Asynchronous Transfer Mode) switch.  The basic idea is that data is
divided into cells which are 53 byte packets and then transmitted along
the transmission path without synchronization. The application area here
is ISDN and HDTV. Such a switching system will be able to handle
multi-media communication of voice, data, video, etc.  Fujitsu has been
working on this project for several years, and CUT? -> NEW--I didn't
like your English & changed it below a bit.  claim that they have
prototyped the world's first ATM switch.  Built out of a special IC
using a BI-CMOS RAM and logic gate array, the current system is a 16 by
16 switch, of three stages with two 8 by 8 crossbar switches per stage.
Each port is 78 MHz and 16-bit wide, allowing for 1.2 Gbits/second per
port.  The 16 by 16 switch, housed in one cabinet, therefore can handle
128 150 Mbits/second channels.  There is a 128-cell buffer at each
output port of every crossbar.  Switch routing is based on the
destination tag, corresponding to the virtual circuit identifier (VCI)
number. Cell sequencing is maintained, but cells may lose data if there
is congestion.  Presently, two 16 by 16 prototypes have been built and
are being used to evaluate cell lossage characteristics. Eventually a
SONET interface will be installed, but this is not supported yet.
Instead a proprietary interface is being used during the testing phase
of the project.

In parallel processing the company's research effort emphasizes more
special-purpose machines such as SP and RP than we would expect from a
U.S. company. The best research projects such as ATM switch, SP, and RP,
are completely driven by development needs.  The strongest efforts seem
to be related to switching and the CAD related issues.  Projects more to
do with basic research such as AP1000 do not seem to be as advanced
compared to work in the U.S.

MISCELLANEOUS FUJITSU COMPUTING ACTIVITIES.
Neurocomputing. The usual metric here is the number of changes to the
weight matrix that are possible each second. Earliest research in
neurocomputing used traditional computers to simulate the architecture
of a neural network. The next step is to implement some aspects of the
network in hardware. By using special purpose digital signal processor
chips Fujitsu has demonstrated more than 500 million connection changes
per second.  A longer range goal is to use biological elements as part
of the architecture, but we have seen no substantial results yet.
 
Associated with neuro computers are various forms of inference engines
that are often implemented with robot applications in mind. Fujitsu has
also been working in these areas with particular emphasis on robot
vision. This again relies of special purpose hardware. They have also
used fuzzy logic to study driverless vehicles and obstacle avoidance.
They have developed the Idaten color image processing system which can
be used to distinguish objects moving at different speeds, and so, for
example, to do real time scanning of a runner, determine speed and
stride and then estimate the time to finish line. This particular
research has applications in many other areas and should be followed.

Another neural net research project has been joint with Nikko Securities
to investigate how well neural nets can predict the buy/sell times for
stock transactions and to rate convertible bonds by looking at various
financial indices.
    Takashi Kimoto
    Computer Based Systems Lab
    Fujitsu Laboratories, Kawasaki
    1015 Kamikodanaka, Nakanara-Ku, Kawasaki 211, Japan

Electrical devices, including an 8 bit Josephson digital signal
processor, and room temperature HMETs (High Electron Mobility
Transistor).  In 1980 Fujitsu developed HMET. At liquid nitrogen
temperatures, -196C, electrons move about 200 times as fast as they do
in silicon.  As part of the government sponsored "high speed computing"
project Fujitsu has now developed a 4K-bit static RAM that operates at
room temperature with 500 pico second clock (fastest memory operations
yet reported), and a 4.1K-gate gate array.  Further developments have
resulted in a chip with 3335 HEMTs with 490ps data propagation time.
Fujitsu claims that they will use this in a new version of a
supercomputer they will soon build.   Presently, several prototype
system components at the LSI level have been built. These are a
1.1K-gate bus driver, a 3.3K-gate random number generator (1.6GHz), and
an 8-bit digital-to-analog converter (1.2GHz).  This technology, which
is almost completely proprietary to Fujitsu, may be significantly useful
in future computing systems.  However, since the HPP project is over, it
will not be easy for Fujitsu to build these kind of experimental
supercomputers unless they can be supported by some new government
programs.

Our overall host for this visit was
           Mr. Shigeru Sato
           Board Director
           Fujitsu Laboratories
           1015 Kamikodanaka Nakahara-ku
           Kawasaki 211, Japan
           Tel: (044) 777-1111
Mr. Sato spent many years in one of Fujitsu's development "works" before
moving to the laboratory. We were impressed with his basic grasp of
technical issues and understanding of the role that research plays in
the development cycle. We asked him if the efforts of other Japanese
companies (such as NEC) to establish research laboratories outside of
Japan had any parallel at Fujitsu. He explained that Fujitsu had several
active research collaborations including at the Australian National
University, mentioned above, and it was also looking into the
possibility of having closer contacts with some U.S. universities such
as Carnegie Mellon, in Pittsburgh. Although he was remarkably frank with
us, we didn't have time to discuss strategic issues with Sato. We did
ask about the success of technology transfer, and he suggested that one
reason for its success is that researchers define the research project
with development groups before the project actually begins.

Two days after this initial visit, Kung with T. W. Kang (General
manager, Systems Group of Intel Japan) went back to visit the Fujitsu
Laboratories again for a meeting with their researchers.  The purpose of
the meeting was to discuss applications areas for iWarp-like distributed
memory parallel machines.  We identified several potential areas and had
some lively discussions.  It was generally felt that some CAD areas and
the neural net learning can make the best use of parallel machines.  In
the CAD area, we predicted that the expected speed up ratio due to
parallel processing will be 100,000 for logic simulation, 1,000 for
test-pattern generation and for placement and routing, and 100 for
design rule check and circuit simulation.  The fruitful discussion
meeting was organized by:
      Fumiyasu Hirose,  Senior Researcher 
      Artificial Intelligence Laboratory 
      Fujitsu Laboratories LTD. 
      1015, Kamikodanaka
      Nakahara-Ku Kawasaki 211 
      Tel: (044) 754-2663 FAX: (044) 754-2580 
      Email: hirose@yugao.stars.flab.fujitsu.co.jp


NEC.
Kahaner visited this factory in March 90 and reported on the SX-3 at
that time. Then the only running system has one processor. Now, several
one processor machines are being tested prior to shipment and a two
processor system has been setup and is being debugged.  Chief designer
Watanabe stated that a one processor system depending upon peripheral
options would cost in the neighborhood of $10 million U.S. He claimed
that the 4 processor system will be up in a few months, and we have
heard estimates that it will cost roughly $25 million.

Peak performance of a uniprocessor system is 5.5 GFLOPS, based on a
cycle time of 2.9 nanoseconds and 16 simultaneous operations
(16/2.5=5.5).  The vector unit in such a system consists of one, two, or
four sets of vector pipelines. Each vector pipeline set consists of two
add/shift and two multiply/logical functional pipelines. Each of the
functional pipelines can be operated simultaneously; thus the arithmetic
processor in a uniprocessor system with four vector pipeline sets can
execute up to 16 floating point operations per machine cycle.  To get
near peak performance all 16 pipes must be kept busy.  Data are fed to
and exit from the arithmetic pipes to vector registers, with a maximum
capacity of 144KB.  It is unlikely that an SX-3 system would be
purchased without all four pipes in each processor.

The four processor system is thus capable of 22 GFLOPS peak, although
this assumes that all the data can be kept in the vector registers. To
the extent that data must be brought from main memory to the registers
performance may degrade. The bandwidth between memory and the registers
depends on the memory hardware technology, and on how the data is
arranged in the memory banks, but serious applications must keep data in
registers to get good performance.  Further, 22 GFLOPS requires 64
simultaneous operations, and this will mean that different operations
have to occur simultaneously.  Also, unless the user program can be
divided up into simultaneous, independent tasks that use the same data
in the vector registers, arrays will have to be quite long to absorb the
startup penalty of being parcelled out to several processors.  The most
effective environment for such multiprocessors is a busy multiuser
computer center, similar to that for other large multiprocessors. Most
computer centers will charge a penalty for single users who want to grab
all four processors.  Yoshihara also discussed some aspects of this in
benchmark calculations earlier this year, see Kahaner's distribution 1
May 1990 "yosh".

At least three or four uniprocessor systems have been sold, in Europe.
We were not told about sales of two or four processor systems.

Users can write Fortran without any special directives. NEC provides an
automatic parallelizing and vectorizing compiler option. We had no
opportunity to test this. Watanabe showed us results of running 100 by
100 LINPACK (all Fortran) giving performance on the SX-3 Model 13
(uniprocessor) and several other supercomputers as follows. He also
showed some corresponding figures for 1000 by 1000 linear system and for
1024 by 1024 matrix multiplication given below.  The last two columns
correspond to what Dongarra calls "best effort".  There are no
restrictions on the method used or its implementation.  Matrix
multiplication runs almost at theoretical peak speed. The large linear
system runs at slightly less than 70% of peak, while on the Cray the
same calculation runs at just above 80%. The differences are probably
associated with bandwidth from memory to the vector registers.
Nevertheless, at 3.8 GFLOPS the SX-3 is 80% faster than the Cray.
 
                      Ax=b                  Ax=b            A=B*C
                     LINPACK               Best Effort     Matrix Mult
                 100 x 100  Fortran        1000 x 1000    1024 by 1024
SX-3/14                      216 MFLOPS     3.8 GFLOPS     5.1 GFLOPS
Fujitsu VP2600               147            2.9            4.8 (4096 by 4096)
Hitachi S-820/80             107
Cray Y-MP8 (8 processors)    275            2.1
Cray Y-MP1 (1 processor)      90
Cray X-MP4                                  0.8

(Note: VP2600 model was not specified for the Ax=b figures, and was /10
for A=B*C, but both 2600/10 and /20 have the same peak performance, 5
GFLOPS.) To the best of our knowledge, figures for the NEC and Fujitsu
machines are new.  We asked Watanabe if the SX-3 four processor
performance would scale up, and he only exclaimed "God knows".

NEC's chip technology is very good. Using ECL, they have crammed 20,000
gates with 70 pico second switching time onto one chip.  We think that
this is better than in the U.S.  A 1,200-pin multi-chip package can hold
100 such chips and dissipate 3K watts.  Packaging, carrier, and cooling
technology is about as good as in the U.S.  NEC claims that they have
taken extra care to design in error testing capability and that about
30% of their chip area is associated with diagnostic functions. (This is
certainly different from some U.S. manufacturers.)  The memory system
uses 20ns 256Kbit SRAMs.  A memory card can hold 32 MBytes.  Thus a
memory cabinet with 32 memory cards has 1 Gbytes.  Two peripherals are
worth noting. NEC  makes a cartridge tape unit (IBM compatible tapes),
fully automated, with 1.2 terabyte capacity.  NEC also makes a disk
array made of eight byte-interleaved disks.  Used as a single disk
drive, the disk array has a 5.5 gigabyte capacity.  The burst transfer
rate is 19.6 MBytes/sec, whereas the sustained transfer rate is 15.7
MBytes/sec.

NEC has begun publication of a newsletter about the SX-3, SX World.
Interested readers can obtain a copy by writing NEC, 1st Product
Planning Department, EDP Product Planning Division, 7-1 Shiba 5-chome,
Minato-ku, Tokyo 108-01, Japan. In this their view of supercomputing is
stated explicitly, "the actual performance of a supercomputer is
determined by its scalar performance...NEC's approach to supercomputer
architecture is clear. Our first priority is to provide high-speed
single processor systems which have vector processing functions and are
driven by the fastest technologies, while giving due consideration to
ease of programming and ease of use; we also seek to provide shared
memory multiprocessor systems to further improve performance."

The SX-3 looks like an exciting machine that is on a par with the best
currently available U.S. products.  There is a new U.S.  supercomputer
from Cray Research nearly ready to be released, as well as perhaps
models from Cray Computer Corporation and others,  but we have no
concrete information about their performance.  In its four processor
version, the SX-3 might be the fastest large scale supercomputer, but
this will be entirely dependent on the application and the skill of the
compiler writers. Fujii and Tamura ("Capability of Current
Supercomputers for Computational Fluid Dynamics", Inst of Space and
Astronautical Sci, Yoshinodai 3-1-3, Sagamihara, Kanagawa, 229 Japan),
note that "Basically the speed of the computations simply depend on when
the machines were introduced into the market. Newer machines show better
performance, and companies selling older machines are going to introduce
new machines."

NEC develops software expertise by use of in-house training; they have a
"college" for their employees. For example, Watanabe is in charge of
courses related to machine design. They also have a long history of
vector computing experience, as NEC mainframes have had vector pipes for
many years. They do not have experience in large scale multiprocessors
as far as we know, except through the HPP project, which was never
commercialized.  To developing software, NEC relies on 30 or so of its
subsidiaries in various places of Japan.  So software is often developed
in a distributed manner.

Watanabe told us that NEC did not have any plans to develop a smaller
general purpose multiprocessor, as they felt that the market would not
support the volume that would be required for profitability. Watanabe
has moved from the SX-3 factory to the corporate headquarters as a
strategic product planner.  The latter is one of the largest buildings
in central Tokyo, is shaped exactly like the U.S.  space shuttle except
for a huge gaping hole in its center to reduce wind loading. It is said
to be the "world's smartest building."

Watanabe represents an illustration of the remark made earlier about
senior research people moving into other corporate functions.

              Dr. Tadashi Watanabe
              Assistant General Manager
              EDP Product Planning Division
              NEC Corporation
              7-1, Shiba 5-chome
              Minato-ku, Tokyo 108-01
              Tel: (03) 798-6830 (Direct), (03) 454-1111
              Fax: (03) 798-6838

As far as innovative architectures are concerned, the SX-3 does not seem
to represent a substantial leap from state-of-the-art supercomputers.
Researchers in parallel computing are not excited by shared memory
machines, which they feel cannot scale up to make the kind of quantum
increases in computing speed that they are seeking.  But as an engine
for solving complicated scientific and engineering problems,  a factor
of two, or even a percentage improvement translates into real money and
new science. What is significant is how far NEC has come in a relatively
few years.  Now NEC has state-of-the-art capabilities in all aspects of
supercomputers, except perhaps in some software applications.  They do
not give away any area and make every effort to build everything
themselves. Customers seem quite loyal in Japan. Software compatibility
with existing systems, and personal relationships between vendor and
customer are important here, perhaps taking the edge off price
differences or delivery dates.

It is difficult to accurately judge how the SX-3 will compare with the
new U.S. supercomputers that should be delivered within the next year,
but it is clear that it should be at least competitive with them.  It
would be very useful for western researchers to have an opportunity to
study, test, and use this computer.  We have not had any chance to run
on the SX-3 although potential customers have had a few of their
important programs benchmarked.  One system, probably a two or four
processor version, will be installed in the NEC HNSX facility in
Houston.  We do not know about access to that, however in the past the
most impressive learning has occurred when supercomputers were "on
site".  Real benchmarking can only occur when a computer is used day in,
day out, and all aspects of its capabilities, problems and reliability
are uncovered.  If it isn't practical to get an SX-3 into a major U.S.
laboratory, we should consider the possibility of sending computational
scientists to Japan for several months or even a year, in order to
thoroughly evaluate the machine. NEC should be interested in these
efforts too.

MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES.
Other than the SX-3 series supercomputer, NEC has been involved in at
least four other parallel processing activities.  These are: (1) FMP
(Fingerprint Machine Computer) with 28 processors, (2) VSP-4 (Video
Signal Processor) with 128 processors, (3) HAL (Parallel Processing
Logic Simulator) with 64 processors, and (4) CENJU (Parallel Processing
Circuit Simulator) with 64 processors.  FMP is a commercial product and
about 180 sets of FMP systems have been shipped in the world.  The VSP
effort has influenced the NEC Visualink-1000 which is commercially
available.  HAL has been used in designing NEC SX-3, SX-3 Series and
other general purpose computers since 1985.  CENJU is an experimental
machine being used for design of DRAM, and Kahaner reported on CENJU in
a 2 July 1990 report "spice".

-----------END OF PART 2-------------------------------------------------