Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!europa.asd.contel.com!gatech!hubcap!fpst From: rick@cs.arizona.edu (Rick Schlichting) Newsgroups: comp.parallel Subject: Kahaner Report: JSPP '91 (Addendum) Message-ID: <1991Jun18.194021.5218@hubcap.clemson.edu> Date: 17 Jun 91 15:05:34 GMT Sender: fpst@hubcap.clemson.edu (Steve Stevenson) Organization: U of Arizona CS Dept, Tucson Lines: 300 Approved: parallel@hubcap.clemson.edu (installment one) [Dr. David Kahaner is a numerical analyst visiting Japan for two-years under the auspices of the Office of Naval Research-Asia (ONR/Asia). The following is the professional opinion of David Kahaner and in no way has the blessing of the US Government or any agency of it. All information is dated and of limited life time. This disclaimer should be noted on ANY attribution.] [Copies of previous reports written by Kahaner can be obtained from host cs.arizona.edu using anonymous FTP.] To: Distribution From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp] Re: Joint Symposium on Parallel Processing '91, Kobe Japan, 14-16 May 1991 Addendum 17 June 1991 Shortly after I distributed my report on JSPP.91 I received extended English abstracts of six papers from Kyushu University. The titles of these papers are given below [I have also included the abstracts--rick.] LIST OF TITLES: (1) KRPP: Kyushu University Reconfigurable Parallel Processor --- Design of Processing-Element --- (2) IPC: Integrated Parallelizing Compiler --- Network Synthesizer --- (3) A Parallel Rendering Machine for High Speed Ray-Tracing --- Instruction-Level Parallelism in Macropipeline Stages --- (4) A Single-Chip Vector Processor Prototype Based on Streaming/FIFO Architecture --- Evaluation of Macro Operation, Vector-Scalar Cooperation, and Terminating Vector Operations --- (5) DSNS (Dynamically hazard resolved, Statically code scheduled, Nonuniform Superscalar) Processor Prototype --- Evaluation of the Architecture and the Effect of Static Code Scheduling -- (6) Hyperscalar Processor Architecture --- The Fifth Approach to Instruction-Level Parallel Processing --- A Single-Chip Vector-Processor Prototype Based on Streaming/FIFO Architecture - Evaluation of Macro Operation, Vector-Scalar Cooperation and Terminating Vector Operations Takashi Hashimoto, Keizou Okazaki, Tetsuo Hironaka, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {hashimot,keizo,hironaka,murakami}@is.kyushu-u.ac.jp A single-chip vector processor prototype based on the streaming/FIFO architecture has been developed at Kyushu University. The goal of the streaming/FIFO Architecture is to achieve flexible chaining ability and to support vector-scalar cooperative execution by means of FIFO vector registers. In addition, a virtual-pipeline mechanism, which is a derivative of pipeline-shared MIMD, is adopted to achieve high pipeline utilization. To increase performance of vector processor, the vectorization ratio must be improved. For this, macro operations and DO-loops with conditionally executed statements, which are usually difficult to vectorize, must be vectorized. Vector operations, such as summation of vector-elements, inner product, finding maximum/minimum values, and first-order recurrence, have data dependency, which is essentially unsuitable for vectorization. In conventional vector processors, these operations are treated as macro operations with special extra hardware. On the other hand, with the streaming/FIFO architecture, such operations can be vectorized without special extra hardware, because of the flexible chaining ability of the architecture. DO-loops with conditionally executed statements can be classified to: (i) completely vectorizable loop, (ii) partially vectorizable loop, and (iii) unvectorizable loops such as loops with jump outs. The prototype processor handles these loops by the following methods. (1) Completely vectorizable loops: (i) vector-mask controls, (ii) scatter-gather, (iii) index-vector, and (iv) split-merge, which is a specialized form of scatter-gather method. (2) Partially vectorizable loops: Vector-scalar cooperative execution are applied. It resolves the data dependency between a vectorized loop and an unvectorized loop and allows both loops to overlap. (3) Unvectorizable loops: Terminating vector operations when a condition for branching from the loop is detected. This paper describes these mechanisms and reports simulation results. A Parallel Rendering Machine for High Speed Ray-Tracing - Instruction- Level Parallelism in the Macropipeline Stages Seiji Murata, Oubong Gwun, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {murata,gwun,murakami}@is.kyushu-u.ac.jp A parallel rendering machine for high-speed ray tracing has been built at Kyushu University. The machine exploits multiple-level parallelism as follows. (1) Multiprocessing image space by dividing it among processors. (2) Macropipelining every ray within each processor. (3) Instruction-level parallelism with VLIW architectures. Ray-tracing task is usually composed of two subtasks: intersection and shading. The overall processing speed is usually bounded by the processing speed for the intersection subtask. Thus it is important to speed up the intersection subtask. Intersection subtask can be divided further into two subtasks: object search and intersection calculation with 3D grids subdivision. Finally, the task for processing each ray can be divided into three subtasks: (i) object search, (ii) intersection calculation, and (iii) shading calculation. The ray processing is implemented by three-stage macropipeline, each stage corresponding to a subtask: (i) object-search stage, (ii) intersection-calculation stage, and (iii) shade-calculation stage. The stages are connected by FIFO buffer cyclically, and those are executed concurrently. To achieve high macropipeline utilization, it is necessary to balance the load of every stage. The result of load measurement showed that the object-search stage and the intersection-calculation stage have four times larger load than the shade-calculation stage does. To balance the object-search and intersection-calculation stages with the shade-calculation stage, we have provided instruction-level parallel processing for the former stages, as follows. (1) Object-search stage: It is main process to search a voxel which contains primitives using 3DDDA algorithm. This algorithm is made up of renewal of index coefficient to proceed a neighbor voxel and memory access, and does it over and over until primitives are found. Software pipelining can be adopted into such a loop. There are four functional units: two ALUs and two FPUs. (2) Intersection-calculation stage: The main calculation is a three-dimensional floating point arithmetic. This stage has multiple operations of similar form which can be processed simultaneously, so it is suitable for VLIW architecture. There are four functional units: one ALU and three FPUs. As a result, the performance with instruction-level parallel processing and macropipelining is 5-9 times higher over the performance with sequential processing.