Xref: utzoo comp.arch:8690 comp.sys.intel:748 Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!decvax!ima!haddock!suitti From: suitti@haddock.ima.isc.com (Stephen Uitti) Newsgroups: comp.arch,comp.sys.intel Subject: Re: i860 overview (long) Message-ID: <12000@haddock.ima.isc.com> Date: 9 Mar 89 20:01:21 GMT References: <807@microsoft.UUCP> <92634@sun.uucp> <13322@steinmetz.ge.com> <1133@auspex.UUCP> Reply-To: suitti@haddock.ima.isc.com (Stephen Uitti) Organization: Interactive Systems, Boston Lines: 132 In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: >>One problem with any chip which requires alligned data is that >>performance suffers when addressing bytes, to the point that a program >>may become impractical. > > [talk about instruction times being the same for byte/word/long > accesses or SPARC, MIPS]. Byte accesses on the PDP-10 were slower - one had to set up a byte pointer and do special load-byte or load-byte-and-increment-the-pointer instructions. Still, bytes were any size from 1 bit to 36... Also remember that even if an 8 bit byte access takes (about) the same time as a 32 bit word access, it still moves less data. I've had some code do its work using larger quantities for just this reason. Usually, the code is #ifdef'ed, so that the easier version can at least be read if not used. One can often do "vector bit" operations a word at a time. The whole "duff's device" bcopy & memcpy discussions of a few months ago are at least partly based on this idea. >BTW, there exist CISC machines that require alignment, as well; as I >remember, all but the most recent AT&T WE32K chips require it. The VAX doesn't require it - but don't do it. A 32 bit word reference to an odd address is real slow. That's why the C compiler there does so much word alignment. Even so, one would see a program that worked on a VAX that would die on a machine which would just plain forbid the operation. Data became unaligned, typically by writing them to disk and then reading them back in. The VAX would be slow for the operation (nobody cared), but other machines would yield bus errors. It seems to me that if an architecture traps unaligned data references, the kernel can look at the instruction that faulted and make it appear to work via software. uVAX IIs implement all sorts of VAX instructions that just aren't in the hardware. Both VMS & flavors of UNIX do this (sometimes even correctly). (Remember, DEC said these things would work, even though there are billions of them & the uVAX II CPU fits on a QBus board... and with a MB of RAM.) Almost no one uses these instructions, so who cares? If the compilers try to make things aligned, and if the Operating System fixes things when botched, and if the Operating System provides a way for the user (programmer) to detect that it happened, and how much, then everyone should be happy. I'd be willing to have unaligned data fetches work 100x slower if the overall architecture could be otherwise, say, twice as fast (because there was enough chip space for an I cache or FPU or something). >>One of the people here checked his Sun-30 (68020) against his Sun-4 >>(SPARC). The three ran troff about 5x faster. > [attempted explanations] >This leaves 1) or 2); is there one I missed? I had one VAX 780 outperform another due to the system binaries for the program being differant. Recompilation & cross running showed that the hardware was the same. Of course, the Sun 3 and Sun 4 are not binary compatible, and the original user probably doesn't have sources... I had one VAX 780 outperform another by 20% due to a ringing 9600 BAUD tty line. It had been that way for months - no one noticed... I ran various "benchmarks" between uVAX IIs and Sun 4s. The range was about 2x to over 8x, averaging about 4x. I never got the 10 (VAX) MIPS figures that were commonly quoted. VAX 780s really are a little faster than uVAX IIs. (aside:) In the olden days when 68000s were brand new, the EE dept at Purdue was considering getting a bunch of 68000s, with troff in ROM & some communication gear, and have troff run on the dedicated boxes. The 68000 could run troff at something like 90% the speed of the 780, which was likely to be much more CPU than a user could get out of the 780s there. I remember wondering if the I/O would kill the 780s making the whole exercise moot... Remote execution (load sharing) on the local ethernet was implemented and it did work pretty well, technically (politically was another matter). I had thought that having a pre-built (buildcore) "troff -ms", etc., would save them more. I recall it taking troff something like 20 seconds to do the initialization for the first .PP for the "-ms" macros. Pretty gross if you ask me (don't ask). >I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a >Sun-4/260 with 32MB memory, both running 4.0. Here are the times: > >Sun-4/260: > auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null > 24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w > auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null > 24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w > >Sun-3/50: > > bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null > 118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w > bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null > 120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w > >The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that >example! Could 1) be the correct explanation? The VAX 780 here running 4.3 BSD had this to say: haddock% time troff -t -man /usr/man/man1/csh.1 >/dev/null troff: unrecognized -t option 0.1u 0.0s... This is much faster than the Suns. It just optimized the operation a bit, being an "experienced VAX" (as opposed to a "used VAX"). The Compaq 386/25 sitting here was even faster, saying something like "troff command not found". I'm unfamiliar with the the "-t" option. haddock% time troff -man /usr/man/man1/csh.1 >/dev/null 90.8u 6.4s 36% 95+201k 59+15io 24pf+0w I thought Sun 3's were lots faster than 780s. Maybe more expensive Sun 3s are faster... Of course, my /usr/man/man1/csh.1 could be differant, though it is probably at least real similar. Also, I think 'troff' is one of those applications that has odd behaviour compared to just about anything else one would run. It should be pointed out (if it hasn't been already) that troff doesn't do nearly the byte accesses that one would think it should do. Still, troff is a great benchmark for sites that do alot of troff. Stephen Uitti, suitti@ima.ima.isc.com (near harvard.harvard.edu)