Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!europa.asd.contel.com!gatech!hubcap!fpst
From: rick@cs.arizona.edu (Rick Schlichting)
Newsgroups: comp.parallel
Subject: Kahaner Report: JSPP '91 (Addendum)
Message-ID: <1991Jun19.115509.6512@hubcap.clemson.edu>
Date: 19 Jun 91 11:55:09 GMT
Sender: fpst@hubcap.clemson.edu (Steve Stevenson)
Organization: Clemson University
Lines: 173
Approved: parallel@hubcap.clemson.edu

(installment 2)

DSNS Processor Prototype - Evaluation of the Architecture and the Effect 
                 of Static Code Schedule 
     Akira Noudomi, Morihiro Kuga, Kazuaki Murakami (Interdisciplinary
     Graduate School of Engineering Sciences, Kyushu University)
     Tetsuya Hara (Mitsubishi Electric Co.)
     Shinji Tomita (Kyoto University)
          E-mail:  {noudomi,kuga,murakami}@is.kyushu-u.ac.jp
       DSNS is a superscalar processor architecture with the following
     features.
      (1) The hazards such as data dependencies and conflicts of the 
     functional units are resolved dynamically.  
      (2) Superscalar processor needs code schedulings to get much 
     parallelizations.  In DSNS architecture, code schedulings are not 
     supported dynamically, but statically with the compiler.  
      (3) DSNS has nonuniform functional units.  
        A prototype processor based on the DSNS architecture and an 
     optimizing compiler, called DSNS compiler, have been being built at 
     Kyushu University.  DSNS compiler has capability of static code 
     scheduling.  DSNS compiler supports both global and local code 
     schedulings to make the most of DSNS processor's capacity for 
     parallel executions.  
        Local code scheduling is based on the list scheduling taking 
     account of functional units; function types, numbers and execution 
     latencies. For global code scheduling, DSNS compiler employs 
     percolation scheduling for global code motion, and loop 
     restructurings such as loop unrolling and software pipelining for 
     loop structured programs.  
       This paper describes the outline of DSNS architecture and static 
     code scheduling supported in DSNS compiler, and reports some 
     simulation results on the performance of the architecture and the 
     scheduling.  As a result of this simulation, the paper confirms the 
     two facts about the relation between degree of superscalar (i.e., 
     degree of instruction supplying) and the effect of static code 
     schedulings.  One of the facts is that high degree of superscalar is 
     useless without advanced static code schedulings.  And the other is 
     that the effect of static code schedulings is clear with high degree 
     of superscalar.  From these facts the paper confirms that the degree 
     of superscalar, which is four in the prototype processor, is 
     reasonable.  The effectiveness of local code scheduling and the 
     techniques for loop restructurings such as loop unrolling and 
     software pipelining are also confirmed.  

Hyperscalar Processor Architecture - The Fifth Approach to Instruction-Level
                 Parallel Processing
     Kazuaki Murakami (Interdisciplinary Graduate School of Engineering
     Sciences, Kyushu University)
          E-mail: murakami@is.kyushu-u.ac.jp
        A new processor architecture, called hyperscalar processor 
     architecture, is proposed and discussed.  Hyperscalar processor 
     architecture encompasses superscalar, VLIW, and vector 
     architectures.  
        Hyperscalar processor architecture has the following major 
     features: 
      (1) the instruction size and the instruction-fetch bandwidth are 
     the same as those of superscalar processors, 
      (2) a VLIW instruction can be self-created with loading several 
     short instructions into instruction registers, each associated with 
     a functional unit, 
      (3) the self-created VLIW program can be in the form of either 
     vectorized loops or software-pipelined loops.  
        This paper presents the principles of operation and examples of 
     vectorized loops and software-pipelined loops.  

An Approach to Realizing a Reconfigurable Interconnection Network Using 
                 Field Programmable Gate Arrays 
     Toshinori Sueyoshi, Itsujiro Arita (Kyushu Institute of Technology) 
     Kouhei Hano (Kyocera Inc.) 
      E-mail: sueyoshi@ai.kyutech.ac.jp
        We present a new reconfigurable interconnection network utilizing 
     the reconfigurability facilities of FPGA (Field Programmable Gate 
     Array), a kind of programmable logic LSI. Reconfiguration for the 
     desired connections  on our proposed reconfigurable interconnection 
     network is performed by programming the configuration data to each 
     FPGA, so that it can be directly implemented without simulation to 
     both: the static networks such as mesh and hypercube networks, and 
     dynamic networks such as baseline and omega networks. Consequently, 
     the optimum connections for interprocess communications or memory 
     reference patterns in executing application programs over the 
     reconfigurable multiprocessor can be configured adaptively by 
     programming.  

IPC: Integrated Parallelizing Compiler - Network Synthesizer
     Hiroki Akaboshi, Kazuaki murakami, Akira Fukuda (Interdisciplinary
     Graduate School of Engineering Sciences, Kyushu University)
     Shinji Tomita (Kyoto University)
          E-mail: {akaboshi,murakami,fukuda}@is.kyushu-u.ac.jp
        The authors have been developing the `Integrate Parallelizing
     Compiler (IPC) System.' IPC has the following features: (i)
     multilingualism, (ii) retargetability (or multitargetability), and
     (iii) multilevel parallelism.
        One of target machines of IPC is KRPP (Kyushu University 
     Reconfigurable Parallel Processor).  KRPP is an MIMD parallel 
     processor which consists of 128 PEs (Processing Elements) connected 
     by a 128*128 crossbar network with reconfigurability.  The crossbar 
     network provides an operation mode, called preset mode, to program 
     inter-PE communication topologies in it.  The program for inter-PE 
     communications is a sequence of inter-PE switching patterns.  To map 
     inter-task communication topologies into a program for inter-PE 
     communications, network synthesis should be done at compile time.  
     Thus IPC has a network synthesizer.  
          Network Synthesizer comprises four basic steps as follows;
      (1) Pattern Generation: Generate a set of switching patterns which 
     should be satisfied some conditions by using algorithm which find 
     minimal edge coloring of bipartite graph.  
      (2) Sequence Generation: Generate a sequence of switching patterns 
     by scheduling inter-PE communication on switching patterns. This 
     scheduling algorithm is based on list scheduling.  
      (3) Split Communication Gantt Chart (CGC): Split CGC to optimize 
     individual switching patterns.  
      (4) Back End: Output a network control information stored as a 
     program for inter-PE communications.  
       This paper presents a network synthsis algorithm which generates a 
     set of switching patterns and a sequence of switching patterns for 
     reconfigurable and programmable network architectures.  


KRPP: Kyushu University Reconfigurable Parallel Processor
     Naoya Tokunaga, Shinichiro Mori, Kazuaki Murakami, Akira Fukuda
     (Interdisciplinary Graduate School of Engineering Sciences, Kyushu
     University)
     Tomoo Ueno (Kyushu Nippon Electric Co.)
     Eiji Iwata (Sony Co.)
     Koji Kai (Matsushita Electric Ind. Co.)
     Shinji Tomita (Kyoto University)
          E-mail: {tokunaga,mori,murakami,fukuda}@is.kyushu-u.ac.jp
        The Kyushu University Reconfigurable Parallel Processor (KRPP) 
     has been developed at Kyushu University.  KRPP is an MIMD-type 
     multiprocessor which consists of 128 Processing-Elements (PEs) fully 
     connected by a 128*128 crossbar network.  Each PE consists of four 
     components: Processor Unit (PU), Memory Unit (MU), Message 
     Communication Unit (MCU), and Back-Plane Bus Interface (BPBIF).  
       To construct a high performance multiprocessor system and to offer 
     an experimental parallel processing environment, KRPP employs the 
     following reconfigurable architectures.  
      (1) Reconfigurable network architecture: The reconfigurable network 
     should have an ability to accommodate itself to arbitrary topologies 
     for inter-PE (i.e., processor-memory and/or processor-processor) 
     paths.  KRPP adopts the crossbar network as such one. In addition, 
     KRPP equips crossbar network with control memories to program its 
     topologies, as IBM GF11 does.  
      (2) Reconfigurable memory architecture: There are two paradigms of 
     multiprocessor configurations, shared-memory TCMP (Tightly Coupled 
     MultiProcessor) and message-passing LCMP (Loosely Coupled 
     MultiProcessor).  KRPP provides both of these paradigms effectively, 
     although it is built in the form of distributed memory organization.  
       In addition to these reconfigurability of network and memory, KRPP 
     provides high-level hardware support for both caching and inter-PE 
     communication, as follows: 
      (3) Cache architecture: To reduce effective access time and network 
     traffic, KRPP provides each PE with a private virtual-address cache.  
     And KRPP allows a cache to hold copies of data from remote-memory, 
     as well as those from local-memory.  To ensure cache coherence, KRPP 
     provides the following four cache coherence schemes: (i) 
     cacheability marking scheme, (ii) fast selective invalidation 
     scheme, (iii) distributed limited-directory scheme, and (iv) dual-
     directory cache scheme.  
      (4) Communication architecture: To realize both TCMP and LCMP, it 
     is necessary to offer two different communication methods, shared-
     memory access and message-passing.  In order to implement these on 
     the single crossbar network, a hierarchical communication protocol 
     is established.  With this protocol, the MCU offers these 
     communication methods.  
      This paper is a status report of real machine implementation.  At 
     first, it presents the overview of KRPP.  Then, it describes the 
     design philosophy of KRPP.  After that, it introduces more detailed 
     specifications of the hardware components of PE.  
-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell