Path: utzoo!utgpu!!!uunet!!gatech!hubcap!fpst From: (Rick Schlichting) Newsgroups: comp.parallel Subject: Kahaner Report: JSPP '91 (Addendum) Message-ID: <> Date: 19 Jun 91 11:55:09 GMT Sender: (Steve Stevenson) Organization: Clemson University Lines: 173 Approved: (installment 2) DSNS Processor Prototype - Evaluation of the Architecture and the Effect of Static Code Schedule Akira Noudomi, Morihiro Kuga, Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Tetsuya Hara (Mitsubishi Electric Co.) Shinji Tomita (Kyoto University) E-mail: {noudomi,kuga,murakami} DSNS is a superscalar processor architecture with the following features. (1) The hazards such as data dependencies and conflicts of the functional units are resolved dynamically. (2) Superscalar processor needs code schedulings to get much parallelizations. In DSNS architecture, code schedulings are not supported dynamically, but statically with the compiler. (3) DSNS has nonuniform functional units. A prototype processor based on the DSNS architecture and an optimizing compiler, called DSNS compiler, have been being built at Kyushu University. DSNS compiler has capability of static code scheduling. DSNS compiler supports both global and local code schedulings to make the most of DSNS processor's capacity for parallel executions. Local code scheduling is based on the list scheduling taking account of functional units; function types, numbers and execution latencies. For global code scheduling, DSNS compiler employs percolation scheduling for global code motion, and loop restructurings such as loop unrolling and software pipelining for loop structured programs. This paper describes the outline of DSNS architecture and static code scheduling supported in DSNS compiler, and reports some simulation results on the performance of the architecture and the scheduling. As a result of this simulation, the paper confirms the two facts about the relation between degree of superscalar (i.e., degree of instruction supplying) and the effect of static code schedulings. One of the facts is that high degree of superscalar is useless without advanced static code schedulings. And the other is that the effect of static code schedulings is clear with high degree of superscalar. From these facts the paper confirms that the degree of superscalar, which is four in the prototype processor, is reasonable. The effectiveness of local code scheduling and the techniques for loop restructurings such as loop unrolling and software pipelining are also confirmed. Hyperscalar Processor Architecture - The Fifth Approach to Instruction-Level Parallel Processing Kazuaki Murakami (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) E-mail: A new processor architecture, called hyperscalar processor architecture, is proposed and discussed. Hyperscalar processor architecture encompasses superscalar, VLIW, and vector architectures. Hyperscalar processor architecture has the following major features: (1) the instruction size and the instruction-fetch bandwidth are the same as those of superscalar processors, (2) a VLIW instruction can be self-created with loading several short instructions into instruction registers, each associated with a functional unit, (3) the self-created VLIW program can be in the form of either vectorized loops or software-pipelined loops. This paper presents the principles of operation and examples of vectorized loops and software-pipelined loops. An Approach to Realizing a Reconfigurable Interconnection Network Using Field Programmable Gate Arrays Toshinori Sueyoshi, Itsujiro Arita (Kyushu Institute of Technology) Kouhei Hano (Kyocera Inc.) E-mail: We present a new reconfigurable interconnection network utilizing the reconfigurability facilities of FPGA (Field Programmable Gate Array), a kind of programmable logic LSI. Reconfiguration for the desired connections on our proposed reconfigurable interconnection network is performed by programming the configuration data to each FPGA, so that it can be directly implemented without simulation to both: the static networks such as mesh and hypercube networks, and dynamic networks such as baseline and omega networks. Consequently, the optimum connections for interprocess communications or memory reference patterns in executing application programs over the reconfigurable multiprocessor can be configured adaptively by programming. IPC: Integrated Parallelizing Compiler - Network Synthesizer Hiroki Akaboshi, Kazuaki murakami, Akira Fukuda (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Shinji Tomita (Kyoto University) E-mail: {akaboshi,murakami,fukuda} The authors have been developing the `Integrate Parallelizing Compiler (IPC) System.' IPC has the following features: (i) multilingualism, (ii) retargetability (or multitargetability), and (iii) multilevel parallelism. One of target machines of IPC is KRPP (Kyushu University Reconfigurable Parallel Processor). KRPP is an MIMD parallel processor which consists of 128 PEs (Processing Elements) connected by a 128*128 crossbar network with reconfigurability. The crossbar network provides an operation mode, called preset mode, to program inter-PE communication topologies in it. The program for inter-PE communications is a sequence of inter-PE switching patterns. To map inter-task communication topologies into a program for inter-PE communications, network synthesis should be done at compile time. Thus IPC has a network synthesizer. Network Synthesizer comprises four basic steps as follows; (1) Pattern Generation: Generate a set of switching patterns which should be satisfied some conditions by using algorithm which find minimal edge coloring of bipartite graph. (2) Sequence Generation: Generate a sequence of switching patterns by scheduling inter-PE communication on switching patterns. This scheduling algorithm is based on list scheduling. (3) Split Communication Gantt Chart (CGC): Split CGC to optimize individual switching patterns. (4) Back End: Output a network control information stored as a program for inter-PE communications. This paper presents a network synthsis algorithm which generates a set of switching patterns and a sequence of switching patterns for reconfigurable and programmable network architectures. KRPP: Kyushu University Reconfigurable Parallel Processor Naoya Tokunaga, Shinichiro Mori, Kazuaki Murakami, Akira Fukuda (Interdisciplinary Graduate School of Engineering Sciences, Kyushu University) Tomoo Ueno (Kyushu Nippon Electric Co.) Eiji Iwata (Sony Co.) Koji Kai (Matsushita Electric Ind. Co.) Shinji Tomita (Kyoto University) E-mail: {tokunaga,mori,murakami,fukuda} The Kyushu University Reconfigurable Parallel Processor (KRPP) has been developed at Kyushu University. KRPP is an MIMD-type multiprocessor which consists of 128 Processing-Elements (PEs) fully connected by a 128*128 crossbar network. Each PE consists of four components: Processor Unit (PU), Memory Unit (MU), Message Communication Unit (MCU), and Back-Plane Bus Interface (BPBIF). To construct a high performance multiprocessor system and to offer an experimental parallel processing environment, KRPP employs the following reconfigurable architectures. (1) Reconfigurable network architecture: The reconfigurable network should have an ability to accommodate itself to arbitrary topologies for inter-PE (i.e., processor-memory and/or processor-processor) paths. KRPP adopts the crossbar network as such one. In addition, KRPP equips crossbar network with control memories to program its topologies, as IBM GF11 does. (2) Reconfigurable memory architecture: There are two paradigms of multiprocessor configurations, shared-memory TCMP (Tightly Coupled MultiProcessor) and message-passing LCMP (Loosely Coupled MultiProcessor). KRPP provides both of these paradigms effectively, although it is built in the form of distributed memory organization. In addition to these reconfigurability of network and memory, KRPP provides high-level hardware support for both caching and inter-PE communication, as follows: (3) Cache architecture: To reduce effective access time and network traffic, KRPP provides each PE with a private virtual-address cache. And KRPP allows a cache to hold copies of data from remote-memory, as well as those from local-memory. To ensure cache coherence, KRPP provides the following four cache coherence schemes: (i) cacheability marking scheme, (ii) fast selective invalidation scheme, (iii) distributed limited-directory scheme, and (iv) dual- directory cache scheme. (4) Communication architecture: To realize both TCMP and LCMP, it is necessary to offer two different communication methods, shared- memory access and message-passing. In order to implement these on the single crossbar network, a hierarchical communication protocol is established. With this protocol, the MCU offers these communication methods. This paper is a status report of real machine implementation. At first, it presents the overview of KRPP. Then, it describes the design philosophy of KRPP. After that, it introduces more detailed specifications of the hardware components of PE. -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst} Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell