Xref: utzoo comp.arch:8690 comp.sys.intel:748
Path: utzoo!utgpu!utstat!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!decvax!ima!haddock!suitti
From: suitti@haddock.ima.isc.com (Stephen Uitti)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: i860 overview (long)
Message-ID: <12000@haddock.ima.isc.com>
Date: 9 Mar 89 20:01:21 GMT
References: <807@microsoft.UUCP> <92634@sun.uucp> <13322@steinmetz.ge.com> <1133@auspex.UUCP>
Reply-To: suitti@haddock.ima.isc.com (Stephen Uitti)
Organization: Interactive Systems, Boston
Lines: 132

In article <1133@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes:
>>One problem with any chip which requires alligned data is that
>>performance suffers when addressing bytes, to the point that a program
>>may become impractical.
>
> [talk about instruction times being the same for byte/word/long
> accesses or SPARC, MIPS].

Byte accesses on the PDP-10 were slower - one had to set up a byte
pointer and do special load-byte or load-byte-and-increment-the-pointer
instructions.  Still, bytes were any size from 1 bit to 36...

Also remember that even if an 8 bit byte access takes (about) the
same time as a 32 bit word access, it still moves less data.
I've had some code do its work using larger quantities for just
this reason.  Usually, the code is #ifdef'ed, so that the easier
version can at least be read if not used.  One can often do
"vector bit" operations a word at a time.  The whole "duff's
device" bcopy & memcpy discussions of a few months ago are at
least partly based on this idea.

>BTW, there exist CISC machines that require alignment, as well; as I
>remember, all but the most recent AT&T WE32K chips require it.

The VAX doesn't require it - but don't do it.  A 32 bit word
reference to an odd address is real slow.  That's why the C
compiler there does so much word alignment.  Even so, one would
see a program that worked on a VAX that would die on a machine
which would just plain forbid the operation.  Data became
unaligned, typically by writing them to disk and then reading
them back in.  The VAX would be slow for the operation (nobody
cared), but other machines would yield bus errors.

It seems to me that if an architecture traps unaligned data
references, the kernel can look at the instruction that faulted
and make it appear to work via software.  uVAX IIs implement all
sorts of VAX instructions that just aren't in the hardware.  Both
VMS & flavors of UNIX do this (sometimes even correctly).
(Remember, DEC said these things would work, even though there
are billions of them & the uVAX II CPU fits on a QBus board...
and with a MB of RAM.)  Almost no one uses these instructions, so
who cares?  If the compilers try to make things aligned, and if
the Operating System fixes things when botched, and if the
Operating System provides a way for the user (programmer) to
detect that it happened, and how much, then everyone should be
happy.  I'd be willing to have unaligned data fetches work 100x
slower if the overall architecture could be otherwise, say, twice
as fast (because there was enough chip space for an I cache or
FPU or something).

>>One of the people here checked his Sun-30 (68020) against his Sun-4
>>(SPARC). The three ran troff about 5x faster.

> [attempted explanations]
>This leaves 1) or 2); is there one I missed?

I had one VAX 780 outperform another due to the system binaries
for the program being differant.  Recompilation & cross running
showed that the hardware was the same.  Of course, the Sun 3
and Sun 4 are not binary compatible, and the original user
probably doesn't have sources...

I had one VAX 780 outperform another by 20% due to a ringing
9600 BAUD tty line.  It had been that way for months - no one
noticed...

I ran various "benchmarks" between uVAX IIs and Sun 4s.  The
range was about 2x to over 8x, averaging about 4x.  I never got
the 10 (VAX) MIPS figures that were commonly quoted.  VAX 780s
really are a little faster than uVAX IIs.

(aside:) In the olden days when 68000s were brand new, the EE
dept at Purdue was considering getting a bunch of 68000s, with
troff in ROM & some communication gear, and have troff run on the
dedicated boxes.  The 68000 could run troff at something like 90%
the speed of the 780, which was likely to be much more CPU than a
user could get out of the 780s there.  I remember wondering if
the I/O would kill the 780s making the whole exercise moot...
Remote execution (load sharing) on the local ethernet was
implemented and it did work pretty well, technically (politically
was another matter).  I had thought that having a pre-built
(buildcore) "troff -ms", etc., would save them more.  I recall it
taking troff something like 20 seconds to do the initialization
for the first .PP for the "-ms" macros.  Pretty gross if you ask
me (don't ask).

>I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a
>Sun-4/260 with 32MB memory, both running 4.0.  Here are the times:
>
>Sun-4/260:
>	auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>	24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w
>	auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>	24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w
>
>Sun-3/50:
>
>	bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null
>	118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w
>	bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null
>	120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w
>
>The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that
>example!  Could 1) be the correct explanation?

The VAX 780 here running 4.3 BSD had this to say:

	haddock% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	troff: unrecognized -t option
	0.1u 0.0s...

This is much faster than the Suns.  It just optimized the
operation a bit, being an "experienced VAX" (as opposed to a
"used VAX").  The Compaq 386/25 sitting here was even faster,
saying something like "troff command not found".  I'm unfamiliar
with the the "-t" option.

	haddock% time troff -man /usr/man/man1/csh.1 >/dev/null
	90.8u 6.4s 36% 95+201k 59+15io 24pf+0w

I thought Sun 3's were lots faster than 780s.  Maybe more
expensive Sun 3s are faster...  Of course, my /usr/man/man1/csh.1
could be differant, though it is probably at least real similar.
Also, I think 'troff' is one of those applications that has odd
behaviour compared to just about anything else one would run.

It should be pointed out (if it hasn't been already) that troff
doesn't do nearly the byte accesses that one would think it
should do.  Still, troff is a great benchmark for sites that do
alot of troff.

Stephen Uitti, suitti@ima.ima.isc.com (near harvard.harvard.edu)