Path: utzoo!utgpu!water!watmath!clyde!rutgers!cmcl2!husc6!yale!mfci!root
From: root@mfci.UUCP (SuperUser)
Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Message-ID: <222@m2.mfci.UUCP>
Date: 20 Jan 88 16:25:57 GMT
References: <235@unicom.UUCP> <28200088@ccvaxa>
Reply-To: colwell@m6.UUCP (Robert Colwell)
Organization: Multiflow Computer Inc., Branford, CT. 06405
Lines: 154

In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>>/* Written  2:43 pm  Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */
>>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
>>|   The only problem when doing that is jump instructions.  Assume that memory
>>| operates at its fastest possible speed.  If you meet a jump instruction
>>| in the middle of the 128-bit word, you'll have to (more or less) execute
>>| all the rest up till the end of the fetch.  Some RISC CPU's have done this
>>| but for only one instruction.  Can a compiler succesfully put 3 useful
>>| instructions after the jump??  Maybe it sounds too cheap: "After any jump
>>| the CPU executes at most three instructions after it". (Actually it turns
>>| out that the jump has to be in the first of the 4 instructions in the 128-bit
>>| word, so as the memory can get the right address.) 
>>
>>It is very hard to effectively schedule more than 1 instruction after a
>>delayed-branch, and even more prohibitive to force branches to occur
>>only at 4-instruction boundaries.  A solution to this problem is to
>>throw away the instructions coming from memory, sourcing them instead
>>from a cache while the instruction stream is restarted.  This is what
>>the Branch-Target Cache is for on the Am29000.
>>
>>	-- Tim Olson
>
>Another approach is to execute all of the instructions come what may.
>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>
>
>
>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>    aglew@gould.com     	- preferred, if you have nameserver
>    aglew@gswd-vms.gould.com    - if you don't
>    aglew@gswd-vms.arpa 	- if you use DoD hosttable
>    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
>   
>My opinions are my own, and are not the opinions of my employer, or any
>other organisation. I indicate my company only so that the reader may
>account for any possible bias I may have towards our products.


Right, and moreover, Multiflow's Trace can have as many as 4
conditional branches plus a fall-through in a given instruction
(prioritized on a per-instr basis using a simple but clever
arbitration scheme.)  While we have a delayed-branch (2 clock beats
per instruction, and one set-of-branches per instruction, so at least
one beat of real work always gets executed) the compiler's forte is
finding useful things to do and scheduling them wherever they fit
best.  We also use four 32-bit 65 nS buses at full tilt to fill the
icache (same buses are used for memory transfers, or cpu
cluster-to-cluster transfers), another recent topic of interest here.

Also, while I'm at it, we provide a way for programmers to indicate
likelihood of branching one way or another if they happen to know
(people who are porting their code from a Cray usually do).  The
keyword isn't "Frequency" but the intent is the same.  On the other
hand, we have found that our compiler does a fine job of predicting
branch probabilities without user input and also without having to 
profile the code.  We initially thought the profiling would be very
important, but (at least for branch prediction) it hasn't turned out
that way.

    -Bob Colwell
     Multiflow Computer
     175 N. Main St.
     Branford, CT 06405  203-488-6090

<Standard disclaimer:  I speak only for me>Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Summary: 
Expires: 
References: <235@unicom.UUCP> <28200088@ccvaxa>
Sender: 
Reply-To: colwell@m6.UUCP (Robert Colwell)
Followup-To: 
Distribution: 
Organization: Multiflow Computer Inc., Branford, CT. 06405
Keywords: 
Newsgroups: comp.arch
Subject: Re: Performance increase - a suggestion
Summary: 
Expires: 
References: <235@unicom.UUCP> <28200088@ccvaxa>
Sender: 
Reply-To: colwell@m6.UUCP (Robert Colwell)
Followup-To: 
Distribution: 
Organization: Multiflow Computer Inc., Branford, CT. 06405
Keywords: 

In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>>/* Written  2:43 pm  Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */
>>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes:
>>|   The only problem when doing that is jump instructions.  Assume that memory
>>| operates at its fastest possible speed.  If you meet a jump instruction
>>| in the middle of the 128-bit word, you'll have to (more or less) execute
>>| all the rest up till the end of the fetch.  Some RISC CPU's have done this
>>| but for only one instruction.  Can a compiler succesfully put 3 useful
>>| instructions after the jump??  Maybe it sounds too cheap: "After any jump
>>| the CPU executes at most three instructions after it". (Actually it turns
>>| out that the jump has to be in the first of the 4 instructions in the 128-bit
>>| word, so as the memory can get the right address.) 
>>
>>It is very hard to effectively schedule more than 1 instruction after a
>>delayed-branch, and even more prohibitive to force branches to occur
>>only at 4-instruction boundaries.  A solution to this problem is to
>>throw away the instructions coming from memory, sourcing them instead
>>from a cache while the instruction stream is restarted.  This is what
>>the Branch-Target Cache is for on the Am29000.
>>
>>	-- Tim Olson
>
>Another approach is to execute all of the instructions come what may.
>This is the VLIW, trace-scheduling, approach commercialized by Multiflow.
>
>
>
>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
>    aglew@gould.com     	- preferred, if you have nameserver
>    aglew@gswd-vms.gould.com    - if you don't
>    aglew@gswd-vms.arpa 	- if you use DoD hosttable
>    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
>   
>My opinions are my own, and are not the opinions of my employer, or any
>other organisation. I indicate my company only so that the reader may
>account for any possible bias I may have towards our products.


Right, and moreover, Multiflow's Trace can have as many as 4
conditional branches plus a fall-through in a given instruction
(prioritized on a per-instr basis using a simple but clever
arbitration scheme.)  While we have a delayed-branch (2 clock beats
per instruction, and one set-of-branches per instruction, so at least
one beat of real work always gets executed) the compiler's forte is
finding useful things to do and scheduling them wherever they fit
best.  We also use four 32-bit 65 nS buses at full tilt to fill the
icache (same buses are used for memory transfers, or cpu
cluster-to-cluster transfers), another recent topic of interest here.

Also, while I'm at it, we provide a way for programmers to indicate
likelihood of branching one way or another if they happen to know
(people who are porting their code from a Cray usually do).  The
keyword isn't "Frequency" but the intent is the same.  On the other
hand, we have found that our compiler does a fine job of predicting
branch probabilities without user input and also without having to 
profile the code.  We initially thought the profiling would be very
important, but (at least for branch prediction) it hasn't turned out
that way.

    -Bob Colwell
     Multiflow Computer
     175 N. Main St.
     Branford, CT 06405  203-488-6090

<Standard disclaimer:  I speak only for me>