Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!zaphod.mps.ohio-state.edu!brutus.cs.uiuc.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew From: aglew@oberon.csg.uiuc.edu (Andy Glew) Newsgroups: comp.arch Subject: QCDPAX (repost from comp.sys.super) Message-ID: Date: 14 Jun 90 16:44:39 GMT Sender: usenet@ux1.cso.uiuc.edu (News) Distribution: comp Organization: University of Illinois, Computer Systems Group Lines: 111 Path: ux1.cso.uiuc.edu!brutus.cs.uiuc.edu!apple!sun-barr!ccut!titcca!etlcom!gama!oyanagi From: oyanagi@gama.is.tsukuba.ac.jp (Yoshio Oyanagi) Newsgroups: comp.sys.super Subject: QCDPAX attained 12.25 GFLOPS peak speed. Message-ID: <5074@gama.is.tsukuba.ac.jp> Date: 31 May 90 06:41:07 GMT Reply-To: oyanagi@gama.is.tsukuba.JUNET (Yoshio Oyanagi) Organization: Info Sci & Elec, Univ of Tsukuba, Tsukuba-City, Ibaraki 305, JAPAN Lines: 99 ===QCDPAX attained 12.25 GFLOPS peak speed=== Parallel Computer QCDPAX has reached the world-fastest(probably) effective speed in scientific calculations. If any computer can exceed the speed of QCDPAX, please let us know. QCDPAX was made public on April 6, 1990, at University of Tsukuba, Tsukuba Science City, Japan. QCDPAX is a parallel computer with 432 PU's (Processing Units). Each PU is running in 28.7 MFLOPS peak speed, and the system in about 12.38 GFLOPS at peak. 12.25 GFLOPS speed is measured for the summation of squares of 500,000 elements within each PU. The machine is a torus-shaped PU array (2-D Nearest Neighbor Mesh with end-around connections), enhanced by a global barrier (hardware) synchronizer, broad bandwidth(32 bits) in nearest neighbor links, broadcast from any PU to all PU's, and feedback ofthe logical AND of status registers of all PU's to all PU's. T. Shirakawa, et. al. "QCDPAX - An MIMD array of vector processors for the numerical simulation of quantum chromodynamics", in Proceedings of Supercomputing '89, Nov. 13-17, 1989, at Reno, Nevada, pp. 495-504. T. Hoshino "PAX Computer, High-Speed Parallel Processing and Scientific Computing" Addison Wesley, 1989. Each PU is a single board vector processor, and employs M68020(25MHz) as the CPU, L64133(60ns, scalar floating-point processor with ALU and MPY, LSI Logic Corp., run actually with 69.8 ns clock), 50K gate ASIC controller for L64133(also LSI Logic's), 2MB SRAM for vector data store(35ns, Japanese), 4MB DRAM for program and archive data store(100ns, Japanese). QCDPAX was designed by us in University of Tsukuba and manufactured by Anritsu Corporation. The project is funded by the Ministry of Education, Science and Culture of the Japanese Government under the Grant-in-Aid for Specially Promoted Research (#62060001). QCDPAX is dedicated to Quantum Chromodynamics simulation (lattice gauge theory) as the budgest required, though the functions are not restricted to that purpose. It is right on the extension of the past 4 prototype PAX machines, in the sense that QCDPAX is of wide use in scientific applications. The machine was benchmarked by the QCD model. In the most time consuming part, 3 by 3 unitary matrix product, QCDPAX with 432 PU's recorded the speed nearly 4 times as fast as that of CM-2, (CM-2's measurement was reported in Supercomputing '89 in Reno by C. F. Baillie, pp.2-9). Single link update time for the subspace heat bath method with 8 hits is 1.8 micro second, which is three times faster than the HITAC S820/80 at KEK (peak 3 Gflops). The benchmark persistently made in the past PAX development is the Poisson equation by Red-Black point-SOR method. This is a typical but quite communication-intensive scientific calculation. We believe that the parallel computer that cannot well process this point-SOR is of no use in the scientific applications. The following is the measurement in the the biggest size that QCDPAX can solve. Definition: 3-D Poisson equation in the pillar region with the size of 408(in X), 414(in Y), and 408(in Z). Mesh spaing is 1. Periodic boundary conditions are set in X and Y-directions, and Dirichlet boundary to zero in the Z-direction. Two point sources of intensity +1 and -1 are located at (102, 103, 102) and (306, 311, 306), respectively. Measurement: Single update-sweep of all points (both red and black) took 175 msec and it is equivalent to 8.99 MFLOPS/PU and 3.88 GFLOPS/system. The nearest neighbor communication took 158 msec for the boundary points between a PU and its 4 neighboring PU's. The efficiency defined by (update)/(update + communication) is 52.46%. The overall effective speed is 2.04 GFLOPS. The program was coded in a compiler-language "psc". Communication was made by calling a function coded in an assembler-language. M68020's cache was disabled. Time is measured by a hardware timer that each PU installs, and MFLOPS value is obtained by total number of + - * divided by the measured time. If any computer can exceed this speed, please let us know. We would like to know if our machine is really the world-fastest or not. T. Hoshino (hoshino@qcdpax.kz.tsukuba.ac.jp) Y. Iwasaki (iwasaki@quark.ph.tsukuba.ac.jp) Y. Oyanagi (oyanagi@gama.is.tsukuba.ac.jp) T. Shirakawa (shirakaw@qcdpax.kz.tsukuba.ac.jp) K. Kanaya (kanaya@quark.ph.tsukuba.ac.jp) T. Yoshie (yoshie@quark.ph.tsukuba.ac.jp) S. Ichii (ichii@kek.ac.jp) T. Kawai (kawai@kz.phys.keio.ac.jp) -- Andy Glew, aglew@uiuc.edu