Xref: utzoo comp.arch:8675 comp.sys.intel:742
Path: utzoo!yunexus!torsqnt!dptcdc!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!rutgers!apple!amdcad!sun!pitstop!sundc!seismo!uunet!auspex!guy
From: guy@auspex.UUCP (Guy Harris)
Newsgroups: comp.arch,comp.sys.intel
Subject: Re: i860 overview (long)
Message-ID: <1133@auspex.UUCP>
Date: 8 Mar 89 07:44:15 GMT
Article-I.D.: auspex.1133
References: <807@microsoft.UUCP> <92634@sun.uucp> <13322@steinmetz.ge.com>
Reply-To: guy@auspex.UUCP (Guy Harris)
Organization: Auspex Systems, Santa Clara
Lines: 62

>One problem with any chip which requires alligned data is that
>performance suffers when addressing bytes, to the point that a program
>may become impractical.

I don't think that's true.  My handy-dandy Cypress CY7C600 Family Users
Guide, for the Cypress SPARC implementation, says that LDSB (LoaD Signed
Byte), LDSH (LoaD Signed Halfword - 16 bits), LDUB (LoaD Unsigned Byte),
LDUH (obvious), and LD (LoaD word - 32 bits), all take 2 cycles.

My handy-dandy MIPS R2000 RISC Architecture manual, alas, has no timings
such as that - after all, it's an *architecture* manual, not a manual
for some particular *implementation* - but I'd be *very* surprised if
byte load/store operations were so much slower that "a program (such as
'troff') may become impractical".  (My expectation is that they're no
slower, just as on SPARC.)

Are you, perhaps, thinking of word-addressible machines, and under the
impression that not only do RISC machines tend to require, say, 4-byte
alignment of 4-byte quantities, but that they can't deal with quantities
shorter than 4 bytes?  That's simply not true of the RISC machines with
which I'm familiar.

BTW, there exist CISC machines that require alignment, as well; as I
remember, all but the most recent AT&T WE32K chips require it.

>One of the people here checked his Sun-30 (68020) against his Sun-4
>(SPARC). The three ran troff about 5x faster.

The only three explanations I can imagine for that, offhand, are:

	1) he's got the two figures backwards; the Sun-4 was ~5x faster
	   than the Sun-3;

	2) the figures are real time, not CPU time, and something else
	   is interfering;

	3) "troff" is floating-point intensive, and the Sun-4 in
	   question has no FPU (e.g., a 4/110 with no FPU).

Explanation 3) falls by the wayside rather quickly; I grepped for
"float" and "double" throughout the code and didn't find it.

This leaves 1) or 2); is there one I missed?

I tried comparing "troff"s on a Sun-3/50 with 4MB memory, and a
Sun-4/260 with 32MB memory, both running 4.0.  Here are the times:

Sun-4/260:
	auspex% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	24.4u 1.2s 0:34 75% 0+456k 26+38io 31pf+0w
	auspex% time troff -t -man /usr/man/man1/csh.1 > /dev/null
	24.4u 1.5s 0:36 71% 0+464k 1+35io 0pf+0w

Sun-3/50:

	bootme% time troff -t -man /usr/man/man1/csh.1 >/dev/null
	118.9u 1.2s 2:08 93% 0+208k 14+33io 24pf+0w
	bootme% time troff -t -man /usr/man/man1/csh.1 > /dev/null
	120.2u 2.8s 2:31 81% 0+192k 5+32io 11pf+0w

The 4/260 did 5x *better* than the 3/50, not 5x *worse*, on that
example!  Could 1) be the correct explanation?