Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!cs.utexas.edu!uunet!yale!mfci!rodman From: rodman@mfci.UUCP (Paul Rodman) Newsgroups: comp.arch Subject: VLIW assembly Message-ID: <939@m3.mfci.UUCP> Date: 7 Jul 89 14:18:07 GMT References: <1035@aber-cs.UUCP> <1370@l.cc.purdue.edu> <2274@wyse.wyse.com> <3243@alliant.Alliant.COM> Sender: rodman@mfci.UUCP Reply-To: rodman@mfci.UUCP (Paul Rodman) Organization: Multiflow Computer Inc., Branford Ct. 06405 Lines: 54 In article <3243@alliant.Alliant.COM> lewitt@Alliant.COM (Martin E. Lewitt) writes: > >Maybe some VLIWs out there are more difficult because they are pushing >the technology harder, trying to encode more in an instruction word or >something. They might sacrifice some of the generality that their >bus structure diagram would lead you to believe was there. I'm curious >about these experiences with other VLIW architectures. Ok, as the person that wrote the FFT package for the Trace 7,14 and 28/300 machines I can throw in my $.02 worth. Not much assembly has been written for the Trace machines, as the compiler gets you very close to peak performance for considerably less effort! :-) However, with an undergraduate physics background I personally wanted to max out FFTs by writing hand-code. [I also wanted to counter the all those that I've heard say "1024 bit VLIWs can't be hand-coded"!] Writing assembly for a 1024 bit instruction word, heavily pipelined machine is not "easy", but what makes it hard is NOT the size of the instruction per se, so I agree with Mr. Lewitt. What makes it hard is the fact that you, the programmer , have so much hardware at your fingertips you refuse to allow a single unneeded instr creep into the algorithm. Sometimes this means you might be juggling a few more balls than you thought...:-) In general though, the flexiblity of the VLIW instr word, i.e. no funny conflicts and interdependencies in the encodings, are a breath of fresh air, compared to the typical CISC microword. And you ALWAYS have another an alu or constant when you need it! A static resource checker keeps you from making dumb resource conflict errors. Of course, it isn't worth hand-coding things very often. Not many programs have such an unbalanced profile as to make it worth while. At the end of my FFT endeavour I had a kernel of 76 instructions that could be run through the M4 macro processor to generate the FFT library for all three machines, and for single or double precision. The performance, needless to say, is as good as the hardware can do: 28/300: 1 dimension, complex, 32 bit fft, 1024 point = 520 microseconds. 1 dimension, complex, 64 bit fft, 1024 point = 930 microseconds. 1 dimension, complex, 32 bit fft, 1e6 point = 901 milliseconds. 1 dimension, complex, 64 bit fft, 1e6 point = 1768 milliseconds. 2 dimension, complex, 32 bit fft, 1k x 1k = 970 milliseconds. 2 dimension, complex, 64 bit fft, 1k x 1k = 1800 milliseconds. Gratifying results. Try the large cases on your workstation. -pkr [A non-quiche eater. :-)]