Path: utzoo!utgpu!water!watmath!clyde!rutgers!cmcl2!nrl-cmf!ames!sdcsvax!ucsdhub!hp-sdd!hplabs!decwrl!sun!pitstop!sundc!seismo!uunet!steinmetz!davidsen From: davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Keywords: bandwidth datapath 128 Message-ID: <8844@steinmetz.steinmetz.UUCP> Date: 15 Jan 88 17:19:18 GMT References: <235@unicom.UUCP> Reply-To: davidsen@crdos1.UUCP (bill davidsen) Organization: General Electric CRD, Schenectady, NY Lines: 45 In article <235@unicom.UUCP> physh@unicom.UUCP writes: ... | It seems to me that one way to get a further increase in | performance without increasing clock speeds and thus memory and other | chip costs, would be to leave the processor at 32 bits and increase the | external data path width to say 128 bits. Then it may be possible to | get at least one instruction per fetch. Since most computers fetch | (mostly) one instruction per operation this would seem to translate | into a 3 times plus increase in performance. (I'm probably wrong here, | but in which direction I have no idea.) There are other ways to get a bandwidth increase. One is to go to a Harvard archetecture, with the data and code using separate busses. The 68030 has gone part way on this, with an internal Harvard archetecture on chip, separate memory caches for each internal bus, and a multiplexed external bus. I believe that Intel will separate the code, data, and i/o busses in the 80486, but that's based on rumor. Sequent reduces bus usage by having a local cache and delayed write through. When a value is modified in the cache, it is *not* written to memory. Only when the value is flushed from cache, or when another processor reads the value, is the modified value placed on the bus. When another processor reads the value, the processor cache which contains the most recently modified version of the data places it on the bus, and the other processor *and the memory* are updated. Obviously there is more to this, to handle the case where (1) processor A reads data from memory, (2) processor B reads data from memory, (3) processor A updates the cached value, and (4) processor B updates the cached value. Now when processor C reads the value, both A and B will try to supply it. I assume that this works, since they have 30 processors running on the bus, but I don't know how. The disadvantage of wider external busses is cost. With a wider bus you have more support logic, and that adds to the cost. Since many data accesses are made on bytes, the extra bus bandwidth doesn't buy you as much as it could, although cache improves this. Since most CPUs already have a pipeline, the wider bandwidth reduces the number of bus cycles, but doesn't improve the rate at which the CPU can execute. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me