Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hplabs!hpda!hpcuhb!hpcuhe!linley From: linley@hpcuhe.cup.hp.com (Linley Gwennap) Newsgroups: comp.arch Subject: Re: Snake Message-ID: <32580006@hpcuhe.cup.hp.com> Date: 26 Mar 91 22:35:14 GMT References: <69465@brunix.UUCP> Organization: PA-RISC Marketing Central Lines: 104 Due to popular demand, here is an article comparing the new Snakes CPU to IBM's "America" chip (used in the RS/6000 series). I have deleted the section on America. I would be happy to post more info if this is useful. --Linley Gwennap Hewlett-Packard HP SNAKES CPU HP's high-performance chip set consists of the "Snakes" CPU chip and a floating point coprocessor ("FPC") jointly developed with Texas Instru- ments[1]. These are the first chips to implement the PA-RISC 1.1 architec- ture. They use a traditional RISC approach to achieve industry-leading performance of 72 SPECmarks with a 66 MHz clock. PA-RISC 1.1, an extension to the original PA-RISC architecture, includes several new instructions, many of which accelerate graphics operations[2]. A multiply-and-add instruction (as in IBM's POWER) is included. In addi- tion, the page size was doubled to 4 KB to reduce the TLB miss rate, and eight "shadow" registers were added to provide quick context switching for the TLB miss handler. The CPU contains all integer instruction processing, cache control and memory management functions. All cache memory is included in external SRAMs connected directly to the CPU. Snakes has a 64-bit path to the D- cache, just like the R4000. Both the I- and D-caches can be accessed simultaneously, resulting in a total cache bandwidth of 792 MB per second (peak). The FPC implements all floating point instructions. It receives instructions and data from the caches at the same time as the CPU, and du- plicates parts of the CPU's instruction pipeline, eliminating the penalties often incurred by separate CPU and FPC chips. Snakes is designed to work with a variety of memory and I/O interfaces. The CPU uses a five-stage pipeline to reduce cycle time. The penalties in this pipeline have been minimized. For example, conditional branches are executed with no delay if their outcome is predicted correctly, and with only a single cycle penalty otherwise. The branch prediction algorithm, more advanced than America's, predicts forward branches to be untaken and backward branches taken, thus optimizing for loops. The load penalty is a maximum of one cycle and the store penalty a maximum of two; these penal- ties can usually be avoided by the compiler. All other integer instructions (except a few rare system control functions) are always executed in a sin- gle cycle. This uncomplicated design is reflected by a simple, efficient compiler. Although Snakes is not superscalar, PA-RISC instructions such as ADD AND BRANCH, MOVE AND BRANCH and COMPARE AND BRANCH allow a similar amount of parallelism as America for integer-only applications; in fact, the ratio of Integer SPECmarks to MHz for Snakes (65/66) actually exceeds America's (35/42). FPC is a full 64-bit implementation. It contains two parallel execution units: the ALU (addition, conversion) and the MPY unit (multiply, divide, square root). Each unit can start a new operation on every other cycle, so FPC can accept one floating point instruction per cycle provided that ALU and MPY instructions are alternated. The external caches are direct mapped and are protected by parity, making them slightly less robust than America's ECC cache. Cache coherency flags are included to facilitate multiprocessor operation. A write-back protocol is used to reduce writes to main memory. Although Snakes does not imple- ment America's complex "critical word first" algorithm on cache misses, it will begin processing as soon as the critical word is obtained, reducing the miss penalty by as much as seven cycles. Snakes supports a wide variety of off-the-shelf SRAMs and can be configured with anywhere from 8 KB to 3 MB of external cache. At its maximum operating frequency of 66 MHz, it requires 12 ns SRAMs. The I- and D-TLBs are fully associative and contain 96 entries each. In addition, each TLB implements four variable size "block" entries capable of mapping up to 16 MB each, which can be used for large portions of the operating system and/or graphics frame buffers. The memory system supports 48 bits (256 terabytes) of virtual address space and 32 bits (4 gigabytes) of real address space. (This is a subset of the full 64-bit virtual space allowed by PA-RISC). Two addressing modes support 1 GB or 4 GB data seg- ments, significantly larger than America's segments. A separate bus provides access to memory, I/O and, if desired, graphics. This bus is a synchronous, dedicated interface with a peak transfer rate of 264 MB per second, about one-half the speed of America's memory system. The bus bandwidth is limited by its width of 32 bits, but a wider bus would have required a larger, more expensive package. Snakes's cache miss penal- ty, measured in cycles, is much higher than America's, due to the shorter clock cycle time. Snakes compensates for these penalties by allowing for large external caches to reduce the miss rate; the performance numbers for Snakes assume a 128 KB instruction cache and 256 KB data cache. The CPU is fabricated in HP's CMOS-26 process (a 1.0 micron, three metal layer process) and packaged in a 408-pin PGA. FPC is fabricated in TI's 0.8 micron CMOS process and placed in a 207-pin PGA. These PGAs were custom-designed to allow high frequency operation with wide CMOS buses. The CPU contains about 577,000 transistors, while FPC uses 640,000. For lower-cost systems, the chip set is designed to run at frequencies below 66 MHz, allowing lower-speed SRAMs to be used. FPC can also be eliminated to further reduce costs. REFERENCES AND NOTES [1] "CMOS PA-RISC Processor for a New Family of Workstations" by M. Forsyth, S. Mangelsdorf, E. DeLano, C. Gleason and J. Yetter, COMPCON Spring 91 Digest of Technical Papers, February 1991. [2] "Architecture and Compiler Enhancements for PA-RISC Workstations" by D. Odnert, R. Hansen, M. Dadoo and M. Laventhal, COMPCON Spring 91 Digest of Technical Papers, February 1991.