Path: utzoo!utgpu!water!watmath!clyde!rutgers!cmcl2!husc6!yale!mfci!root From: root@mfci.UUCP (SuperUser) Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Message-ID: <222@m2.mfci.UUCP> Date: 20 Jan 88 16:25:57 GMT References: <235@unicom.UUCP> <28200088@ccvaxa> Reply-To: colwell@m6.UUCP (Robert Colwell) Organization: Multiflow Computer Inc., Branford, CT. 06405 Lines: 154 In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes: > >>/* Written 2:43 pm Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */ >>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: >>| The only problem when doing that is jump instructions. Assume that memory >>| operates at its fastest possible speed. If you meet a jump instruction >>| in the middle of the 128-bit word, you'll have to (more or less) execute >>| all the rest up till the end of the fetch. Some RISC CPU's have done this >>| but for only one instruction. Can a compiler succesfully put 3 useful >>| instructions after the jump?? Maybe it sounds too cheap: "After any jump >>| the CPU executes at most three instructions after it". (Actually it turns >>| out that the jump has to be in the first of the 4 instructions in the 128-bit >>| word, so as the memory can get the right address.) >> >>It is very hard to effectively schedule more than 1 instruction after a >>delayed-branch, and even more prohibitive to force branches to occur >>only at 4-instruction boundaries. A solution to this problem is to >>throw away the instructions coming from memory, sourcing them instead >>from a cache while the instruction stream is restarted. This is what >>the Branch-Target Cache is for on the Am29000. >> >> -- Tim Olson > >Another approach is to execute all of the instructions come what may. >This is the VLIW, trace-scheduling, approach commercialized by Multiflow. > > > >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 > aglew@gould.com - preferred, if you have nameserver > aglew@gswd-vms.gould.com - if you don't > aglew@gswd-vms.arpa - if you use DoD hosttable > aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? > >My opinions are my own, and are not the opinions of my employer, or any >other organisation. I indicate my company only so that the reader may >account for any possible bias I may have towards our products. Right, and moreover, Multiflow's Trace can have as many as 4 conditional branches plus a fall-through in a given instruction (prioritized on a per-instr basis using a simple but clever arbitration scheme.) While we have a delayed-branch (2 clock beats per instruction, and one set-of-branches per instruction, so at least one beat of real work always gets executed) the compiler's forte is finding useful things to do and scheduling them wherever they fit best. We also use four 32-bit 65 nS buses at full tilt to fill the icache (same buses are used for memory transfers, or cpu cluster-to-cluster transfers), another recent topic of interest here. Also, while I'm at it, we provide a way for programmers to indicate likelihood of branching one way or another if they happen to know (people who are porting their code from a Cray usually do). The keyword isn't "Frequency" but the intent is the same. On the other hand, we have found that our compiler does a fine job of predicting branch probabilities without user input and also without having to profile the code. We initially thought the profiling would be very important, but (at least for branch prediction) it hasn't turned out that way. -Bob Colwell Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090 Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Summary: Expires: References: <235@unicom.UUCP> <28200088@ccvaxa> Sender: Reply-To: colwell@m6.UUCP (Robert Colwell) Followup-To: Distribution: Organization: Multiflow Computer Inc., Branford, CT. 06405 Keywords: Newsgroups: comp.arch Subject: Re: Performance increase - a suggestion Summary: Expires: References: <235@unicom.UUCP> <28200088@ccvaxa> Sender: Reply-To: colwell@m6.UUCP (Robert Colwell) Followup-To: Distribution: Organization: Multiflow Computer Inc., Branford, CT. 06405 Keywords: In article <28200088@ccvaxa> aglew@ccvaxa.UUCP writes: > >>/* Written 2:43 pm Jan 17, 1988 by tim@amdcad.AMD.COM in ccvaxa:comp.arch */ >>In article <221@imagine.PAWL.RPI.EDU> userfe0e@mts.rpi.edu (George Kyriazis) writes: >>| The only problem when doing that is jump instructions. Assume that memory >>| operates at its fastest possible speed. If you meet a jump instruction >>| in the middle of the 128-bit word, you'll have to (more or less) execute >>| all the rest up till the end of the fetch. Some RISC CPU's have done this >>| but for only one instruction. Can a compiler succesfully put 3 useful >>| instructions after the jump?? Maybe it sounds too cheap: "After any jump >>| the CPU executes at most three instructions after it". (Actually it turns >>| out that the jump has to be in the first of the 4 instructions in the 128-bit >>| word, so as the memory can get the right address.) >> >>It is very hard to effectively schedule more than 1 instruction after a >>delayed-branch, and even more prohibitive to force branches to occur >>only at 4-instruction boundaries. A solution to this problem is to >>throw away the instructions coming from memory, sourcing them instead >>from a cache while the instruction stream is restarted. This is what >>the Branch-Target Cache is for on the Am29000. >> >> -- Tim Olson > >Another approach is to execute all of the instructions come what may. >This is the VLIW, trace-scheduling, approach commercialized by Multiflow. > > > >Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 > aglew@gould.com - preferred, if you have nameserver > aglew@gswd-vms.gould.com - if you don't > aglew@gswd-vms.arpa - if you use DoD hosttable > aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? > >My opinions are my own, and are not the opinions of my employer, or any >other organisation. I indicate my company only so that the reader may >account for any possible bias I may have towards our products. Right, and moreover, Multiflow's Trace can have as many as 4 conditional branches plus a fall-through in a given instruction (prioritized on a per-instr basis using a simple but clever arbitration scheme.) While we have a delayed-branch (2 clock beats per instruction, and one set-of-branches per instruction, so at least one beat of real work always gets executed) the compiler's forte is finding useful things to do and scheduling them wherever they fit best. We also use four 32-bit 65 nS buses at full tilt to fill the icache (same buses are used for memory transfers, or cpu cluster-to-cluster transfers), another recent topic of interest here. Also, while I'm at it, we provide a way for programmers to indicate likelihood of branching one way or another if they happen to know (people who are porting their code from a Cray usually do). The keyword isn't "Frequency" but the intent is the same. On the other hand, we have found that our compiler does a fine job of predicting branch probabilities without user input and also without having to profile the code. We initially thought the profiling would be very important, but (at least for branch prediction) it hasn't turned out that way. -Bob Colwell Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090