Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!ames!oliveb!intelca!intsc!tomk
From: tomk@intsc.UUCP (Tom Kohrs @fae)
Newsgroups: comp.sys.m68k,comp.sys.intel
Subject: 386 vs 020 and big benchmarks (sieve)
Message-ID: <933@intsc.UUCP>
Date: Thu, 16-Apr-87 20:56:01 EST
Article-I.D.: intsc.933
Posted: Thu Apr 16 20:56:01 1987
Date-Received: Sun, 19-Apr-87 10:41:32 EST
References: <930@intsc.UUCP> <513@omen.UUCP>
Distribution: comp
Organization: Intel Sales, Silicon Valley, Ca.
Lines: 96
Xref: mnetor comp.sys.m68k:363 comp.sys.intel:159

In article <513@omen.UUCP> caf@omen.UUCP (Chuck Forsberg) writes:
> In article <930@intsc.UUCP> tomk@intsc.UUCP (Tom Kohrs @fae) writes:
> 
> :Show me a benchmark that does not fit in 256 bytes thats even keeps up
                                            ^^^^^^^^^  (note for ref.)
> :with at 16MHz 386.  386's are now shipping at 20MHz for the speed freaks.
> :25MHz soon.
> 
> Well, here's one that takes 8k, somewhat larger than 256 bytes.  A 25 mHz
> 68020 board more than keeps up with a 18 mHz 386 box (let alone 16 mHz.)
> 
Code left in for reference.

> siev.c:
> #define S 8190
> char f[S+1];
> main()
> {
> /*	register long i,p,k,c,n;	For 32 bit entries for PC */
> 	register int i,p,k,c,n;
> 	for (n = 1; n <= 10; n++) {
> 		c = 0;
> 		for (i = 0; i <= S; i++) f[i] = 1;
___
| 		for (i = 0; i <= S; i++) {
| 			if (f[i]) {
| 				p = i + i + 3; k = i + p;
| 				while (k <= S) { f[k] = 0; k += p; }
| 				c++;
| 			}
| 		}
|__
> 	}
> 	printf("\n%d primes.\n", c);
> }
 
The following is the as output of the rcc compiler (no opt.) under Unix 
for the inner loop of the sieve benchmark as included above:

.L21:
	xorl	%edi,%edi
	jmp	.L26
.L27:
	cmpb	$0,f(%edi)
	je	.L28
	movl	%edi,%eax
	addl	%edi,%eax
	leal	3(%eax),%eax
	movl	%eax,%esi
	movl	%edi,%eax
	addl	%esi,%eax
	movl	%eax,%ebx
	jmp	.L30
.L31:
	movb	$0,f(%ebx)
	movl	%esi,%eax
	addl	%eax,%ebx
.L30:
	cmpl	$8190,%ebx
	jle	.L31
.L29:
	incl	-4(%ebp)
.L28:
	incl	%edi
.L26:
	cmpl	$8190,%edi
	jle	.L27

The compiler generated ~62 bytes of code (if I ever figure out sdb I will
know for sure).  Assuming the 020 compiler does not generate more than 4X
the amount of code this will all fit into the 020 cache.  That is what I
meant when I said that the benchmarks that show the 020 as faster fit into
256 bytes.  

If all you want to do is calculate sieves all day then use the '020.  But
if you want to do real crunching on large problems then the 386 will run
circles around the '020.  That is not to say the '020 with the 256 byte
cache does not have its niches.  There is a number of application in the
embedded control area that have inner loops that fit nicely in 256 bytes.
Line drawing routines in graphics applications is one, thats why we build
H/W accelerators for that.  If performance is what you need on Megabyte
size problems the 386 will give you 50% - 75% more speed at the same clock
rate.

BWT:  The numbers for the 386 on the 18MHz CompDyn (.59sec) matched what
I got under Unix on my MB-I box (.59sec).  The biggest performance hit
on this benchmark for the systems tested is due to the wait states taken
for a write (3ws on the MB-I board).  This could easily be fixed on a 
system with posted writes or a write back cache.

> Compile-Link  Execute		Code
> Real	User	Real	User	Bytes	System
> 
> 7.4	.8	.34	.3416	124	Definicom SYS 68020 25mHz SiVlly 11/86
> 11.8	2.8	.56	.56	131	CompDyn (Intel MB) + 386 Toolkit 12/86
  3.0	.6      .59     .59      ?      Intel 310/386 16MHz Unix V.3 rcc  4/16