Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!europa.asd.contel.com!gatech!hubcap!fpst
From: rick@cs.arizona.edu (Rick Schlichting)
Newsgroups: comp.parallel
Subject: Kahaner Report: JSPP '91 (Addendum)
Message-ID: <1991Jun18.194021.5218@hubcap.clemson.edu>
Date: 17 Jun 91 15:05:34 GMT
Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
Organization: U of Arizona CS Dept, Tucson
Lines: 300
Approved: parallel@hubcap.clemson.edu

(installment one)
  [Dr. David Kahaner is a numerical analyst visiting Japan for two-years
   under the auspices of the Office of Naval Research-Asia (ONR/Asia).  
   The following is the professional opinion of David Kahaner and in no 
   way has the blessing of the US Government or any agency of it.  All 
   information is dated and of limited life time.  This disclaimer should 
   be noted on ANY attribution.]

  [Copies of previous reports written by Kahaner can be obtained from
   host cs.arizona.edu using anonymous FTP.]

To: Distribution
From: David K. Kahaner, ONR Asia [kahaner@xroads.cc.u-tokyo.ac.jp]
Re: Joint Symposium on Parallel Processing '91, Kobe Japan, 14-16 May 1991
       Addendum
17 June 1991 

Shortly after I distributed my report on JSPP.91 I received extended 
English abstracts of six papers from Kyushu University. The titles of
these papers are given below [I have also included the abstracts--rick.]


LIST OF TITLES:
(1) KRPP: Kyushu University Reconfigurable Parallel Processor
    --- Design of Processing-Element ---
(2) IPC: Integrated Parallelizing Compiler
    --- Network Synthesizer ---
(3) A Parallel Rendering Machine for High Speed Ray-Tracing
    --- Instruction-Level Parallelism in Macropipeline Stages ---
(4) A Single-Chip Vector Processor Prototype Based on
    Streaming/FIFO Architecture
    --- Evaluation of Macro Operation, Vector-Scalar Cooperation, and
        Terminating Vector Operations ---
(5) DSNS (Dynamically hazard resolved, Statically code scheduled,
    Nonuniform Superscalar) Processor Prototype
    --- Evaluation of the Architecture and
        the Effect of Static Code Scheduling --
(6) Hyperscalar Processor Architecture
    --- The Fifth Approach to Instruction-Level Parallel Processing ---


A Single-Chip Vector-Processor Prototype Based on Streaming/FIFO
          Architecture - Evaluation of Macro Operation, 
          Vector-Scalar Cooperation and Terminating Vector Operations
     Takashi Hashimoto, Keizou Okazaki, Tetsuo Hironaka, Kazuaki Murakami
     (Interdisciplinary Graduate School of Engineering Sciences, Kyushu
     University)
     Shinji Tomita (Kyoto University)
        E-mail: {hashimot,keizo,hironaka,murakami}@is.kyushu-u.ac.jp
       A single-chip vector processor prototype based on the 
     streaming/FIFO architecture has been developed at Kyushu University.  
       The goal of the streaming/FIFO Architecture is to achieve flexible 
     chaining ability and to support vector-scalar cooperative execution 
     by means of FIFO vector registers.  In addition, a virtual-pipeline 
     mechanism, which is a derivative of pipeline-shared MIMD, is adopted 
     to achieve high pipeline utilization.  
       To increase performance of vector processor, the vectorization 
     ratio must be improved.  For this, macro operations and DO-loops 
     with conditionally executed statements, which are usually difficult 
     to vectorize, must be vectorized.  
       Vector operations, such as summation of vector-elements, inner 
     product, finding maximum/minimum values, and first-order recurrence, 
     have data dependency, which is essentially unsuitable for 
     vectorization.  In conventional vector processors, these operations 
     are treated as macro operations with special extra hardware.  On the 
     other hand, with the streaming/FIFO architecture, such operations 
     can be vectorized without special extra hardware, because of the 
     flexible chaining ability of the architecture.  
       DO-loops with conditionally executed statements can be classified 
     to: (i) completely vectorizable loop, (ii) partially vectorizable 
     loop, and (iii) unvectorizable loops such as loops with jump outs.  
     The prototype processor handles these loops by the following 
     methods.  
      (1) Completely vectorizable loops:  (i) vector-mask controls, (ii) 
     scatter-gather, (iii) index-vector, and (iv) split-merge, which is a 
     specialized form of scatter-gather method.  
      (2) Partially vectorizable loops: Vector-scalar cooperative 
     execution are applied. It resolves the data dependency between a 
     vectorized loop and an unvectorized loop and allows both loops to 
     overlap.  
      (3) Unvectorizable loops:  Terminating vector operations when a 
     condition for branching from the loop is detected.  
      This paper describes these mechanisms and reports simulation results.  

A Parallel Rendering Machine for High Speed Ray-Tracing - Instruction-
            Level Parallelism in the Macropipeline Stages 
     Seiji Murata, Oubong Gwun, Kazuaki Murakami (Interdisciplinary Graduate
     School of Engineering Sciences, Kyushu University)
     Shinji Tomita (Kyoto University)
          E-mail: {murata,gwun,murakami}@is.kyushu-u.ac.jp
        A parallel rendering machine for high-speed ray tracing has been 
     built at Kyushu University.  The machine exploits multiple-level 
     parallelism as follows.  
      (1) Multiprocessing image space by dividing it among processors.
      (2) Macropipelining every ray within each processor.
      (3) Instruction-level parallelism with VLIW architectures.
        Ray-tracing task is usually composed of two subtasks: 
     intersection and shading.  The overall processing speed is usually 
     bounded by the processing speed for the intersection subtask.  Thus 
     it is important to speed up the intersection subtask.  
        Intersection subtask can be divided further into two subtasks: 
     object search and intersection calculation with 3D grids 
     subdivision.  Finally, the task for processing each ray can be 
     divided into three subtasks: (i) object search, (ii) intersection 
     calculation, and (iii) shading calculation.  
        The ray processing is implemented by three-stage macropipeline, 
     each stage corresponding to a subtask: (i) object-search stage, (ii) 
     intersection-calculation stage, and (iii) shade-calculation stage.  
     The stages are connected by FIFO buffer cyclically, and those are 
     executed concurrently.  
        To achieve high macropipeline utilization, it is necessary to 
     balance the load of every stage.  The result of load measurement 
     showed that the object-search stage and the intersection-calculation 
     stage have four times larger load than the shade-calculation stage 
     does.  To balance the object-search and intersection-calculation 
     stages with the shade-calculation stage, we have provided 
     instruction-level parallel processing for the former stages, as 
     follows.  
      (1) Object-search stage: It is main process to search a voxel which 
     contains primitives using 3DDDA algorithm.  This algorithm is made 
     up of renewal of index coefficient to proceed a neighbor voxel and 
     memory access, and does it over and over until primitives are found.  
     Software pipelining can be adopted into such a loop.  There are four 
     functional units: two ALUs and two FPUs.  
      (2) Intersection-calculation stage: The main calculation is a 
     three-dimensional floating point arithmetic.  This stage has 
     multiple operations of similar form which can be processed 
     simultaneously, so it is suitable for VLIW architecture.  There are 
     four functional units: one ALU and three FPUs.  
       As a result, the performance with instruction-level parallel 
     processing and macropipelining is 5-9 times higher over the 
     performance with sequential processing.