Path: utzoo!mnetor!uunet!mcvax!dik From: dik@cwi.nl (Dik T. Winter) Newsgroups: comp.arch Subject: Re: Press Release: Intel announces 80960 architecture Message-ID: <7529@boring.cwi.nl> Date: 14 Apr 88 00:29:24 GMT References: <3358@omepd> <49265@sun.uucp> <7543@apple.UUCP> Organization: CWI, Amsterdam Lines: 35 In article <7543@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: > Actually, yes. Despite some fairly clever scoreboarding, many simple > instructions take two cycles. This appears to happen because they have a single > port register file. For example: A+B->C, D+E->F. The second addition will > take 2 cycles. But: A+B->C, C+E->F. The second addition will take 1 cycles. > This is because they forward the ALU result to the second addition, which > saves them a cycle. Ironic, since forwarding usually make instructions run > just as fast as they would if there were no data dependencies; here, data > dependencies make it run faster! > In vector machines this is a well known feature, called short-stop. For Cray-1 and Cray XMP this is true for operations on vector registers. For Cyber 205 and ETA 10 this is true for operations on scalar registers. It requires careful scheduling of your instructions. E.g. on the Cray a short stop occurs some 7 cycles after instruction start; if you miss it you have to wait till the previous instruction terminates. This makes it possible that programs tuned for the Cray-1 run slower on the XMP. Similar things hold for the 205. Here in fact, if I remember correctly, the instruction that uses the result of a previous instruction must be issued in a very small time frame after the previous instruction to benefit from the short stop. It should not be issued too early. E.g. A+B->C;C+D->E might run slower than A+B->C;NOP;NOP;C+D->E (You could of course issue other instructions than NOP: A+B->C;P+Q->R;V+W->Z;C+D->E the instructions are pipelined.) A compiler writers nightmare I believe. -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax