Xref: utzoo soc.culture.japan:5842 comp.sys.super:260 Path: utzoo!attcan!uunet!fernwood!apple!usc!elroy.jpl.nasa.gov!ncar!noao!arizona!rick From: rick@cs.arizona.edu (Rick Schlichting) Newsgroups: soc.culture.japan,comp.sys.super Subject: Kahaner Report: Parallel Computing in Japan (Part 2) Message-ID: <119@saguaro.cs.arizona.edu> Date: 6 Nov 90 01:56:08 GMT Followup-To: soc.culture.japan Organization: U of Arizona CS Dept, Tucson Lines: 728 [Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Far East (ONRFE). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution From: David Kahaner ONRFE [kahaner@xroads.cc.u-tokyo.ac.jp] H.T. Kung CMU [ht.kung@cs.cmu.edu] Re: Aspects of Parallel Computing Research in Japan---NEC & Fujitsu. Date: 6 Nov 1990 ABSTRACT. Some aspects of parallel computing research in Japan are analyzed, based on authors' visits to a number of Japanese universities and industrial laboratories in October 1990. This portion of the report deals with supercomputing and parallel computing at NEC and Fujitsu. PART 2. The following outline describes the topics that are discussed in the various parts of this report. PART 1 OUTLINE------------------------------------------------------------ INTRODUCTION SUMMARY RECOMMENDATIONS PART 2 (this part) OUTLINE------------------------------------------------ FUJITSU OVERVIEW Company profile and computer R&D activities VP2000 series supercomputer organization and performance PARALLEL PROCESSING ACTIVITIES SP (Logic Simulation Engine) AP1000 (Cellular Array Processor) RP (Routing Processor) ATM (Asynchronous Transfer Mode) Switch MISCELLANEOUS FUJITSU ACTIVITIES Neurocomputing HMET NEC SX-3 series supercomputer organization and performance Benchmark data for SX-3, VP2000, and Cray. Comments MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES PART 3 OUTLINE------------------------------------------------------------ HITACHI CENTRAL RESEARCH LABORATORY HDTV PARALLEL AND VECTOR PROCESSING Hyper crossbar parallel processor, H2P Parallel Inference Machine, PIM/C Josephson-Junctions Molecular Dynamics JAPAN ELECTRONICS SHOW, 1990 HDTV Flat Panel Displays MATSUSHITA ELECTRIC Company profile and computer R&D activities ADENA Parallel Processor MISCELLANEOUS ACTIVITIES HDTV Comments about Japanese industry PART 4 OUTLINE----------------------------------------------------------- KYUSHU UNIVERSITY Profile of Information Science Department Reconfigurable Parallel Processor Superscalar Processor FIFO Vector Processor Comments ELECTROTECHNICAL LABORATORY Sigma-1 Dataflow Computer and EM-4 Dataflow Comments CODA Multiprocessor NEW INFORMATION PROCESSING TECHNOLOGY Summary Comments UNIVERSITY OF TSUKUBA PAX SANYO ELECTRIC Company profile and computer R&D activities HDTV END OF OUTLINE---------------------------------------------------------- FUJITSU OVERVIEW. Currently about a $16Billion US corporation (based on 158Yen/$), with sales and income growing about 10%/year. As with most Japanese companies, Fujitsu includes many subsidiaries (Fujitsu Laboratories, Fujitsu Business Systems, Fujitsu America, etc.), and affiliates, and has about 115,000 employees, about 50,000 in Fujitsu proper, the remainder in associated companies. R&D expenses are about 12% of sales and have been increasing more rapidly than sales growth. Corporate sales are divided as follows. Computers 66% Communications 16 Electronic devices 14 Other 4 The most important factor in sales growth was the rapid growth in overseas (outside Japan) sales, now accounting for about one fourth of the total. The company states that major strategic objectives are to strengthen activities in information management, and further globalize the company. Recently they purchased 80% of British based ICL (International Computers Ltd). Global research and development, including software development is mentioned as a specific goal. The company develops and markets a wide range of computers and related peripherals such as disk subsystems, including a 32 workstation with built in CD-ROM with secretary-friendly video and sound, FM-Towns, (apparently available only in Japan) to a large scale supercomputer, VP2000 series, whose deliveries began spring 1990. A vast range of semiconductor devices, memories, etc. and other new technologies, are sold outside the company and also used in Fujitsu specific products. For example, Sun SPARK chips were originally purchased directly from Fujitsu. The company is also very active in important areas of switching and telecommunication technologies related to HDTV, digital switching systems, etc. Fujitsu is also researching high compression rate encoding for visual telephones and TV conferencing, as well as encoding methods for HDTV and variable rate encoding methods for future packet communications. The main research arm of Fujitsu is the Fujitsu Laboratories, a subsidiary corporation that operates two laboratories, one in Kawasaki and the other in Atsugi, both in suburban Tokyo. Total employment is about 1500. The Atsugi lab, established in 1983 is responsible for research in areas of electron devices, electronic systems, and advanced materials. The Kawasaki lab, established in the mid 1960s is on the grounds of some other Fujitsu facilities, so that the total working population there is over 12,000. The Kawasaki lab concentrates on information processing, communication, space, and personal systems. The overall educational background of the laboratories is interesting. Electronics 48% Physics 19 Computer Science 10 Chemistry 10 Mechanical Engineering 5 All others 8 This is certainly one reason for the wealth of activities in hardware relative to software. Half of the staff have Masters degrees; only 10% hold doctorates. As mentioned above, Fujitsu is working hard to be a global corporation. That means both R&D and manufacturing outside of Japan. For example, Fujitsu signed a five year joint research agreement in October 1989 with the Australian National University in Canberra. Subjects include advanced computers, both large scale supercomputers and more exotic parallel computers, and computer vision using the visual mechanism of insects. Another global research project is with the German software company Aris, to develop software for automatic translation of Japanese technical materials and documents into German. When complete, the system will contain a dictionary, syntax for generating German, and appropriate development tools for both the dictionary and the syntax. Various natural language processing and voice recognition systems are also under study, as is a real-time fingerprint sensor system using holography, and an on-line handwritten input system claimed to be able to correctly recognize Kanji, Katakana and Hiragana Japanese characters. Unfortunately we had no opportunity to see any of these last projects. Fujitsu computers are heavily used in the mainframe world. The company's efforts in large scale supercomputers are interesting. More than 100 orders have been received for computers in the VP2000 series. The most powerful model, the VP2600 has a maximum performance of about 5 gigaflops. According to Fujitsu at least one VP2000 has been installed in Kodak headquarters in Rochester NY. What follows is a brief summary Fujitsu VP2000 series supercomputers. Fujitsu offers four models in this series, as follows. VP2100 /10, /20 (peak performance 0.5 GFLOPS) VP2200 /10, /20 (peak performance 1.0 GFLOPS), /40 (peak 2.0 GFLOPS) VP2400 /10, /20 (peak performance 2.0 GFLOPS), /40 (peak 5.0 GFLOPS) VP2600 /10, /20 (peak performance 5.0 GFLOPS) Models designated as /10 have one scalar and one vector arithmetic unit. Models designated as /20 have two scalar and one vector arithmetic units. Models designated as /40 have four scalar and two vector arithmetic units. The /10 and /20 systems are uniprocessor, the /40 is multiprocessor. Their nomenclature is mildly confusing, as the designation /x0 corresponds to the number of scalar rather than vector units, even though the latter determine peak performance. Fujitsu is deeply interested in multiprocessing; one indication has been their MITI-sponsored research jointly with NEC and Hitachi, called informally the HPP project, involving four VP2600s each operating as a uniprocessor attached to a very large shared buffer memory. Fujitsu claims that such a large multiprocessor was developed mainly to demonstrate their success with room temperature HMET devices (see below) as the communications drivers between the computers and memory. Nevertheless, using this, a NEC researcher was able to solve a very large system of 32K linear equations in less than 11 hours. For more details see Kahaner's report 21 June 1990, "japgovt". Fujitsu is probably experimenting on a /40 multiprocessor for the VP2600, but has not released any public information about this. Without a /40 for the VP2600, Fujitsu's VP2000 series peak performance (however unrelated to actual performance) will fall short of current competition from NEC as well as new machines from Cray, and perhaps others. In the meantime though, the VP2000 series come in a variety of colors, including Elegance Red, Future White, and Florence Green. Peak performance of the /10 and /20 models in any line are the same, as this is determined entirely by vector processing. Peak performance can easily be computed once the machine cycle time and the maximum possible number of simultaneous floating point operations are known. For example, the VP2400/40 and VP2600 each have cycle times of 3.2 nanoseconds. To achieve the advertised 5.0 GFLOPS peak implies 16 simultaneous floating point operations. For the VP2400/40 this requires eight per vector unit, while for the VP2600/20 sixteen simultaneous operations are required. Each of Fujitsu's vector units is described as having two arithmetic pipes, but in reality they are more complicated. Each pipe is capable of simultaneously performing both an addition and a multiplication. In addition the pipes effectively deliver twice (VP2400/40) or four times (VP2600/20) as much data. Thus each pipe on the VP2600/20 can produce the result four floating point additions and four floating point multiplications per cycle. This is similar to the "superword" concept on the ill fated Cyber 205. Of course, if a calculation is dyadic, that is does not involve both a multiplication and addition, then the peak performance will be reduced by 50%. By studying the performance of VP2000 machines on typical job streams it has been observed that when the scalar unit is 100% in use, the vector unit is about 50% to 75% busy. Thus the addition of a second scalar unit can significantly increase throughput, and was presumably Fujitsu's reason for adding it. However, for any single user problem it might not be possible to keep the vector unit constantly busy. Thus the most practical environment for such a setup would be a computing center or other multi user job shop, where several user jobs can be run simultaneously. Kyoto University, a typical busy university computing center, will be getting a VP2600/10 soon. We asked about why only one scalar processor. Although the university made a very strong case for two scalar processors, the Ministry of Education decided (based on budgetary, or other, grounds) to only support the one scalar processor system. However it is an easy field upgrade to add the second scalar unit. The choice of a VP2600/10 rather than a VP2400/40 was a matter of policy; Kyoto has always tried to purchase the fastest machine available. It is also possible that they would like to upgrade eventually to a multiprocessor 2600 when this is available. As is the case with most of today's vector supercomputers, data to and from the vector arithmetic units need to pass through vector registers. In the VP2600 these registers have a capacity of 128KB (64 elements times 256 registers times eight byte data) but can be concatenated in various ways, for example as 2048 times 8 times eight byte instead. Thus the organization of the registers is very flexible. To get data between memory and the vector registers Fujitsu only provides two load/store pipelines. This could be a bottleneck, although the register flexibility may allieviate it to a certain extent. Memory to register bandwidth has been criticised in the VP2000 series, but at least one new benchmark, given below, suggests that Fujitsu has been making efforts to deal with this. The computation of interest is that of multiplying large matrices A=B*C, each of which is 4096 by 4096, with real 64 bit floating point components. The source program is written in 100% standard Fortran but is organized to take advantage of the two pipe structure of the VP2000 architecture in a very clear way. The essential segment of the source program consists of first zeroing the target array. DO 4000 J=1,4096 DO 4000 I=1,2048 A(I,J)=0.0 A(I+2048,J)=0.0 4000 CONTINUE Then the actual multiplication is as follows. DO 5000 L=0,1 DO 5000 J=1,4096 DO 5000 K=1,4096,4 DO 5000 II=1,2048 I=II+(2048+L) A(I,J)=A(I,J)+B(I,K)*C(K,J)+B(I,K+1)*C(K+1,J) * +B(I,K+2)*C(K+2,J)+B(I,K+3)*C(K+3,J) 5000 CONTINUE In this case the matrices are large enough that there is significant memory to register to memory traffic. Nevertheless, Fujitsu's FORT77/VP compiler is able to vectorize this effectively and generate 4.8 GFLOPS, 96% of peak performance. One comment is worth making here. At the InfoJapan 90 meeting a lecture was presented by Nobuo Uchida, from the Mainframe division of Fujitsu, on the architecture of the VP2000 series computers. We found it particularly interesting that his paper made no mention of the /40 series in the VP2000 lineup. The English product announcement about the /40 had been distributed shortly before the meeting, and the Japanese announcement was available weeks before that. Because the /40 is a multiprocessor, it represents a most important addition to their product line. The characteristics and properties of new advanced computers are of real interest to the research community, especially those who travel long distances to hear about them. Perhaps there was a manuscript revision that we did not notice. Nevertheless, it was disappointing that this new system was not included in his discussion. Perhaps it is related to Fujitsu's silence about a VP2600 multiprocessor. FUJITSU'S ACTIVITIES IN PARALLEL PROCESSING. In our recent visit to Fujitsu Laboratories, we visited the following three parallel processing projects. (1) SP (Logic Simulation Engine). This is a special purpose 64 processor event driven parallel computer designed to test the logic design of VLSI chips before they are built. It is claimed that it has larger capacity than any other simulator and that simulation times are about 30 times faster than using Fujitsu's 780 mainframe. Testing a 1MB gate chip takes about 4 hours on the SP, and this is 1000 times faster than the 780. The SP is implemented in TTL, with gate arrays for the ECC implementation. (Fujitsu can build 200K gate, 331-pin arrays currently.) Ten SP machines have been built, and 2 are in use by Amdahl in the U.S. The others are for internal use. Fujitsu claims that partly due to its use of event driven simulation, SP is 100 times faster than the IBM Yorktown Simulation Engine and feels that the SP is a successful effort. (NEC Corp also has a logic simulator, Hal II and TDHal.) It seems that most computer companies in Japan have developed their own special purpose parallel engines for logic simulation for their internal use. (2) AP1000, renamed from older CAP (Cellular Array Processor). This is composed of up to 1024 cells or processors. Each cell is composed of a SPARC chip (for ease of software development), Weitek floating point unit and gate array router running at 25MHz, and 16MB of memory. Cells can communicate using wormhole routing in a two dimensional mesh using 25MB/sec channel. The standard structured buffer pool is used to avoid deadlocks. The network also supports row and column broadcasting. The router and SPARC connection is 40 MBytes/sec. Since the connection is also shared by the CPU cache, the actual available bandwidth is still under evaluation. In addition, a special frame buffer can read out from each cell so that image data can be partitioned up among cells efficiently. Maximum performance is 12.5MFLOPS/cell, and 12.8GFLOPS for a fully configured 1024 cell system. AP1000 has good (but not spectacular) communication and good numerical performance potential. Fujitsu expects that it will typically be connected to a Sun-4 as a host via a VME bus. This project has been going on for a number of years under the old name CAP. (CAP is also the name of the Cellular Array Processor developed by Mitsubishi Electric for satellite image processing. As far as we know there is no relation between these projects.) A team of about 10 people have been at work on the AP1000 for two years. The new AP1000 system is much more powerful, primarily because of the use of SPARC chips and Weitek floating-point chips. In contrast, the old system used Intel 80186 chips. Present plans are to begin production this fall with installation of 7 or 8 machines in spring of 1991. Of these, most are to be 64-cell systems and one is to be a 512-cell or 1024-cell system. (A 1,024-cell system is scheduled to be built in April 1991.) Currently a 16-cell system is running. The 64-cell system, with about 800 MFLOPS peak performance should cost the company about $300K U.S. We were shown a straightforward ray-tracing example which is a perfect candidate for data parallelism. The system currently has a home-made run-time system, and no parallelizing compiler for either C or Fortran. We were told that in addition to scientific computing, visualization, and CAD, one potential application was for design rule checking, but in that case it isn't clear why floating point is necessary. The Australian National University will get a 128-node AP1000 system and will help with software development and evaluation. (Contact: Prof. M. McRobbie [mam@arp.anu.edu.au]). As with the earlier CAP project, Fujitsu has a nice color sales brochure about the AP1000, but this is still considered an experimental machine. Probably its most important uses will be internal to Fujitsu, similar to the SP model. We feel that the project is probably a few years behind similar work at leading research places in the U.S., primarily because of the differences in software and interprocessor communications capabilities. Two contacts for this project are given below. Mitsui Ishii [mishi@flab.fujitsu.co.jp] Hiroyuki Sato [hsat@flab.fujitsu.co.jp] Fujitsu Laboratories 1015 Kamikodanaka Nakahara-ku Kawasaki 211, Japan Tel: (044) 777-1111, -2327 (3) RP (Routing Processor). This is a special-purpose SIMD machine to implement the maze routing. A performance goal is to route large (e.g., 100K-gate) gate arrays in approximately one hour. To implement the machine, bit-serial PEs (Processing Elements) are used. A 4K-PE system is operational. We saw a successful demonstration of the system in doing a difficult switch box routing. Since in maze routing only PEs on the wave front are active at a given time, the system will typically multiplex four "logical" PEs onto each "physical" PE to ensure efficient utilization of physical PEs. Approximately 5 people have been working on the RP project for two years. They are currently building a 16K-PE RP. A challenge of using special purpose CAD engines such as the RP is its graceful integration with the rest of the CAD system. Also, it is not clear about how the RP can take advantage of hierarchical information available in a design. Fujitsu researchers are looking at these issues. ATM Switch. In addition to the three parallel processing projects described above, we also visited a major project on the development of an ATM (Asynchronous Transfer Mode) switch. The basic idea is that data is divided into cells which are 53 byte packets and then transmitted along the transmission path without synchronization. The application area here is ISDN and HDTV. Such a switching system will be able to handle multi-media communication of voice, data, video, etc. Fujitsu has been working on this project for several years, and CUT? -> NEW--I didn't like your English & changed it below a bit. claim that they have prototyped the world's first ATM switch. Built out of a special IC using a BI-CMOS RAM and logic gate array, the current system is a 16 by 16 switch, of three stages with two 8 by 8 crossbar switches per stage. Each port is 78 MHz and 16-bit wide, allowing for 1.2 Gbits/second per port. The 16 by 16 switch, housed in one cabinet, therefore can handle 128 150 Mbits/second channels. There is a 128-cell buffer at each output port of every crossbar. Switch routing is based on the destination tag, corresponding to the virtual circuit identifier (VCI) number. Cell sequencing is maintained, but cells may lose data if there is congestion. Presently, two 16 by 16 prototypes have been built and are being used to evaluate cell lossage characteristics. Eventually a SONET interface will be installed, but this is not supported yet. Instead a proprietary interface is being used during the testing phase of the project. In parallel processing the company's research effort emphasizes more special-purpose machines such as SP and RP than we would expect from a U.S. company. The best research projects such as ATM switch, SP, and RP, are completely driven by development needs. The strongest efforts seem to be related to switching and the CAD related issues. Projects more to do with basic research such as AP1000 do not seem to be as advanced compared to work in the U.S. MISCELLANEOUS FUJITSU COMPUTING ACTIVITIES. Neurocomputing. The usual metric here is the number of changes to the weight matrix that are possible each second. Earliest research in neurocomputing used traditional computers to simulate the architecture of a neural network. The next step is to implement some aspects of the network in hardware. By using special purpose digital signal processor chips Fujitsu has demonstrated more than 500 million connection changes per second. A longer range goal is to use biological elements as part of the architecture, but we have seen no substantial results yet. Associated with neuro computers are various forms of inference engines that are often implemented with robot applications in mind. Fujitsu has also been working in these areas with particular emphasis on robot vision. This again relies of special purpose hardware. They have also used fuzzy logic to study driverless vehicles and obstacle avoidance. They have developed the Idaten color image processing system which can be used to distinguish objects moving at different speeds, and so, for example, to do real time scanning of a runner, determine speed and stride and then estimate the time to finish line. This particular research has applications in many other areas and should be followed. Another neural net research project has been joint with Nikko Securities to investigate how well neural nets can predict the buy/sell times for stock transactions and to rate convertible bonds by looking at various financial indices. Takashi Kimoto Computer Based Systems Lab Fujitsu Laboratories, Kawasaki 1015 Kamikodanaka, Nakanara-Ku, Kawasaki 211, Japan Electrical devices, including an 8 bit Josephson digital signal processor, and room temperature HMETs (High Electron Mobility Transistor). In 1980 Fujitsu developed HMET. At liquid nitrogen temperatures, -196C, electrons move about 200 times as fast as they do in silicon. As part of the government sponsored "high speed computing" project Fujitsu has now developed a 4K-bit static RAM that operates at room temperature with 500 pico second clock (fastest memory operations yet reported), and a 4.1K-gate gate array. Further developments have resulted in a chip with 3335 HEMTs with 490ps data propagation time. Fujitsu claims that they will use this in a new version of a supercomputer they will soon build. Presently, several prototype system components at the LSI level have been built. These are a 1.1K-gate bus driver, a 3.3K-gate random number generator (1.6GHz), and an 8-bit digital-to-analog converter (1.2GHz). This technology, which is almost completely proprietary to Fujitsu, may be significantly useful in future computing systems. However, since the HPP project is over, it will not be easy for Fujitsu to build these kind of experimental supercomputers unless they can be supported by some new government programs. Our overall host for this visit was Mr. Shigeru Sato Board Director Fujitsu Laboratories 1015 Kamikodanaka Nakahara-ku Kawasaki 211, Japan Tel: (044) 777-1111 Mr. Sato spent many years in one of Fujitsu's development "works" before moving to the laboratory. We were impressed with his basic grasp of technical issues and understanding of the role that research plays in the development cycle. We asked him if the efforts of other Japanese companies (such as NEC) to establish research laboratories outside of Japan had any parallel at Fujitsu. He explained that Fujitsu had several active research collaborations including at the Australian National University, mentioned above, and it was also looking into the possibility of having closer contacts with some U.S. universities such as Carnegie Mellon, in Pittsburgh. Although he was remarkably frank with us, we didn't have time to discuss strategic issues with Sato. We did ask about the success of technology transfer, and he suggested that one reason for its success is that researchers define the research project with development groups before the project actually begins. Two days after this initial visit, Kung with T. W. Kang (General manager, Systems Group of Intel Japan) went back to visit the Fujitsu Laboratories again for a meeting with their researchers. The purpose of the meeting was to discuss applications areas for iWarp-like distributed memory parallel machines. We identified several potential areas and had some lively discussions. It was generally felt that some CAD areas and the neural net learning can make the best use of parallel machines. In the CAD area, we predicted that the expected speed up ratio due to parallel processing will be 100,000 for logic simulation, 1,000 for test-pattern generation and for placement and routing, and 100 for design rule check and circuit simulation. The fruitful discussion meeting was organized by: Fumiyasu Hirose, Senior Researcher Artificial Intelligence Laboratory Fujitsu Laboratories LTD. 1015, Kamikodanaka Nakahara-Ku Kawasaki 211 Tel: (044) 754-2663 FAX: (044) 754-2580 Email: hirose@yugao.stars.flab.fujitsu.co.jp NEC. Kahaner visited this factory in March 90 and reported on the SX-3 at that time. Then the only running system has one processor. Now, several one processor machines are being tested prior to shipment and a two processor system has been setup and is being debugged. Chief designer Watanabe stated that a one processor system depending upon peripheral options would cost in the neighborhood of $10 million U.S. He claimed that the 4 processor system will be up in a few months, and we have heard estimates that it will cost roughly $25 million. Peak performance of a uniprocessor system is 5.5 GFLOPS, based on a cycle time of 2.9 nanoseconds and 16 simultaneous operations (16/2.5=5.5). The vector unit in such a system consists of one, two, or four sets of vector pipelines. Each vector pipeline set consists of two add/shift and two multiply/logical functional pipelines. Each of the functional pipelines can be operated simultaneously; thus the arithmetic processor in a uniprocessor system with four vector pipeline sets can execute up to 16 floating point operations per machine cycle. To get near peak performance all 16 pipes must be kept busy. Data are fed to and exit from the arithmetic pipes to vector registers, with a maximum capacity of 144KB. It is unlikely that an SX-3 system would be purchased without all four pipes in each processor. The four processor system is thus capable of 22 GFLOPS peak, although this assumes that all the data can be kept in the vector registers. To the extent that data must be brought from main memory to the registers performance may degrade. The bandwidth between memory and the registers depends on the memory hardware technology, and on how the data is arranged in the memory banks, but serious applications must keep data in registers to get good performance. Further, 22 GFLOPS requires 64 simultaneous operations, and this will mean that different operations have to occur simultaneously. Also, unless the user program can be divided up into simultaneous, independent tasks that use the same data in the vector registers, arrays will have to be quite long to absorb the startup penalty of being parcelled out to several processors. The most effective environment for such multiprocessors is a busy multiuser computer center, similar to that for other large multiprocessors. Most computer centers will charge a penalty for single users who want to grab all four processors. Yoshihara also discussed some aspects of this in benchmark calculations earlier this year, see Kahaner's distribution 1 May 1990 "yosh". At least three or four uniprocessor systems have been sold, in Europe. We were not told about sales of two or four processor systems. Users can write Fortran without any special directives. NEC provides an automatic parallelizing and vectorizing compiler option. We had no opportunity to test this. Watanabe showed us results of running 100 by 100 LINPACK (all Fortran) giving performance on the SX-3 Model 13 (uniprocessor) and several other supercomputers as follows. He also showed some corresponding figures for 1000 by 1000 linear system and for 1024 by 1024 matrix multiplication given below. The last two columns correspond to what Dongarra calls "best effort". There are no restrictions on the method used or its implementation. Matrix multiplication runs almost at theoretical peak speed. The large linear system runs at slightly less than 70% of peak, while on the Cray the same calculation runs at just above 80%. The differences are probably associated with bandwidth from memory to the vector registers. Nevertheless, at 3.8 GFLOPS the SX-3 is 80% faster than the Cray. Ax=b Ax=b A=B*C LINPACK Best Effort Matrix Mult 100 x 100 Fortran 1000 x 1000 1024 by 1024 SX-3/14 216 MFLOPS 3.8 GFLOPS 5.1 GFLOPS Fujitsu VP2600 147 2.9 4.8 (4096 by 4096) Hitachi S-820/80 107 Cray Y-MP8 (8 processors) 275 2.1 Cray Y-MP1 (1 processor) 90 Cray X-MP4 0.8 (Note: VP2600 model was not specified for the Ax=b figures, and was /10 for A=B*C, but both 2600/10 and /20 have the same peak performance, 5 GFLOPS.) To the best of our knowledge, figures for the NEC and Fujitsu machines are new. We asked Watanabe if the SX-3 four processor performance would scale up, and he only exclaimed "God knows". NEC's chip technology is very good. Using ECL, they have crammed 20,000 gates with 70 pico second switching time onto one chip. We think that this is better than in the U.S. A 1,200-pin multi-chip package can hold 100 such chips and dissipate 3K watts. Packaging, carrier, and cooling technology is about as good as in the U.S. NEC claims that they have taken extra care to design in error testing capability and that about 30% of their chip area is associated with diagnostic functions. (This is certainly different from some U.S. manufacturers.) The memory system uses 20ns 256Kbit SRAMs. A memory card can hold 32 MBytes. Thus a memory cabinet with 32 memory cards has 1 Gbytes. Two peripherals are worth noting. NEC makes a cartridge tape unit (IBM compatible tapes), fully automated, with 1.2 terabyte capacity. NEC also makes a disk array made of eight byte-interleaved disks. Used as a single disk drive, the disk array has a 5.5 gigabyte capacity. The burst transfer rate is 19.6 MBytes/sec, whereas the sustained transfer rate is 15.7 MBytes/sec. NEC has begun publication of a newsletter about the SX-3, SX World. Interested readers can obtain a copy by writing NEC, 1st Product Planning Department, EDP Product Planning Division, 7-1 Shiba 5-chome, Minato-ku, Tokyo 108-01, Japan. In this their view of supercomputing is stated explicitly, "the actual performance of a supercomputer is determined by its scalar performance...NEC's approach to supercomputer architecture is clear. Our first priority is to provide high-speed single processor systems which have vector processing functions and are driven by the fastest technologies, while giving due consideration to ease of programming and ease of use; we also seek to provide shared memory multiprocessor systems to further improve performance." The SX-3 looks like an exciting machine that is on a par with the best currently available U.S. products. There is a new U.S. supercomputer from Cray Research nearly ready to be released, as well as perhaps models from Cray Computer Corporation and others, but we have no concrete information about their performance. In its four processor version, the SX-3 might be the fastest large scale supercomputer, but this will be entirely dependent on the application and the skill of the compiler writers. Fujii and Tamura ("Capability of Current Supercomputers for Computational Fluid Dynamics", Inst of Space and Astronautical Sci, Yoshinodai 3-1-3, Sagamihara, Kanagawa, 229 Japan), note that "Basically the speed of the computations simply depend on when the machines were introduced into the market. Newer machines show better performance, and companies selling older machines are going to introduce new machines." NEC develops software expertise by use of in-house training; they have a "college" for their employees. For example, Watanabe is in charge of courses related to machine design. They also have a long history of vector computing experience, as NEC mainframes have had vector pipes for many years. They do not have experience in large scale multiprocessors as far as we know, except through the HPP project, which was never commercialized. To developing software, NEC relies on 30 or so of its subsidiaries in various places of Japan. So software is often developed in a distributed manner. Watanabe told us that NEC did not have any plans to develop a smaller general purpose multiprocessor, as they felt that the market would not support the volume that would be required for profitability. Watanabe has moved from the SX-3 factory to the corporate headquarters as a strategic product planner. The latter is one of the largest buildings in central Tokyo, is shaped exactly like the U.S. space shuttle except for a huge gaping hole in its center to reduce wind loading. It is said to be the "world's smartest building." Watanabe represents an illustration of the remark made earlier about senior research people moving into other corporate functions. Dr. Tadashi Watanabe Assistant General Manager EDP Product Planning Division NEC Corporation 7-1, Shiba 5-chome Minato-ku, Tokyo 108-01 Tel: (03) 798-6830 (Direct), (03) 454-1111 Fax: (03) 798-6838 As far as innovative architectures are concerned, the SX-3 does not seem to represent a substantial leap from state-of-the-art supercomputers. Researchers in parallel computing are not excited by shared memory machines, which they feel cannot scale up to make the kind of quantum increases in computing speed that they are seeking. But as an engine for solving complicated scientific and engineering problems, a factor of two, or even a percentage improvement translates into real money and new science. What is significant is how far NEC has come in a relatively few years. Now NEC has state-of-the-art capabilities in all aspects of supercomputers, except perhaps in some software applications. They do not give away any area and make every effort to build everything themselves. Customers seem quite loyal in Japan. Software compatibility with existing systems, and personal relationships between vendor and customer are important here, perhaps taking the edge off price differences or delivery dates. It is difficult to accurately judge how the SX-3 will compare with the new U.S. supercomputers that should be delivered within the next year, but it is clear that it should be at least competitive with them. It would be very useful for western researchers to have an opportunity to study, test, and use this computer. We have not had any chance to run on the SX-3 although potential customers have had a few of their important programs benchmarked. One system, probably a two or four processor version, will be installed in the NEC HNSX facility in Houston. We do not know about access to that, however in the past the most impressive learning has occurred when supercomputers were "on site". Real benchmarking can only occur when a computer is used day in, day out, and all aspects of its capabilities, problems and reliability are uncovered. If it isn't practical to get an SX-3 into a major U.S. laboratory, we should consider the possibility of sending computational scientists to Japan for several months or even a year, in order to thoroughly evaluate the machine. NEC should be interested in these efforts too. MISCELLANEOUS NEC PARALLEL PROCESSING ACTIVITIES. Other than the SX-3 series supercomputer, NEC has been involved in at least four other parallel processing activities. These are: (1) FMP (Fingerprint Machine Computer) with 28 processors, (2) VSP-4 (Video Signal Processor) with 128 processors, (3) HAL (Parallel Processing Logic Simulator) with 64 processors, and (4) CENJU (Parallel Processing Circuit Simulator) with 64 processors. FMP is a commercial product and about 180 sets of FMP systems have been shipped in the world. The VSP effort has influenced the NEC Visualink-1000 which is commercially available. HAL has been used in designing NEC SX-3, SX-3 Series and other general purpose computers since 1985. CENJU is an experimental machine being used for design of DRAM, and Kahaner reported on CENJU in a 2 July 1990 report "spice". -----------END OF PART 2-------------------------------------------------