Path: utzoo!utgpu!water!watmath!clyde!rutgers!cmcl2!nrl-cmf!ames!sdcsvax!ucsdhub!hp-sdd!hplabs!decwrl!sun!pitstop!sundc!seismo!uunet!steinmetz!davidsen
From: davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr)
Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Keywords: bandwidth datapath 128
Message-ID: <8844@steinmetz.steinmetz.UUCP>
Date: 15 Jan 88 17:19:18 GMT
References: <235@unicom.UUCP>
Reply-To: davidsen@crdos1.UUCP (bill davidsen)
Organization: General Electric CRD, Schenectady, NY
Lines: 45

In article <235@unicom.UUCP> physh@unicom.UUCP writes:
...
| 	It seems to me that one way to get a further increase in
| performance without increasing clock speeds and thus memory and other
| chip costs, would be to leave the processor at 32 bits and increase the
| external data path width to say 128 bits.  Then it may be possible to
| get at least one instruction per fetch.  Since most computers fetch
| (mostly) one instruction per operation this would seem to translate
| into a 3 times plus increase in performance. (I'm probably wrong here,
| but in which direction I have no idea.)

There are other ways to get a bandwidth increase. One is to go to a
Harvard archetecture, with the data and code using separate busses. The
68030 has gone part way on this, with an internal Harvard archetecture
on chip, separate memory caches for each internal bus, and a multiplexed
external bus.

I believe that Intel will separate the code, data, and i/o busses in the
80486, but that's based on rumor.

Sequent reduces bus usage by having a local cache and delayed write
through. When a value is modified in the cache, it is *not* written to
memory. Only when the value is flushed from cache, or when another
processor reads the value, is the modified value placed on the bus. When
another processor reads the value, the processor cache which contains
the most recently modified version of the data places it on the bus, and
the other processor *and the memory* are updated.

Obviously there is more to this, to handle the case where (1) processor
A reads data from memory, (2) processor B reads data from memory, (3)
processor A updates the cached value, and (4) processor B updates the
cached value. Now when processor C reads the value, both A and B will
try to supply it. I assume that this works, since they have 30
processors running on the bus, but I don't know how.

The disadvantage of wider external busses is cost. With a wider bus you
have more support logic, and that adds to the cost. Since many data
accesses are made on bytes, the extra bus bandwidth doesn't buy you as
much as it could, although cache improves this. Since most CPUs already
have a pipeline, the wider bandwidth reduces the number of bus cycles,
but doesn't improve the rate at which the CPU can execute.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me