Path: utzoo!utgpu!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!rusvx1.rus.uni-stuttgart.dbp.de!nittmann
From: nittmann@rusvx1.rus.uni-stuttgart.dbp.de ("Michael F.H. Nittmann ")
Newsgroups: comp.protocols.tcp-ip
Subject: some interim notes on the bsd network speedups
Message-ID: <155:nittmann@rusvx1.rus.uni-stuttgart.dbp.de>
Date: 9 Aug 88 03:31:36 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 93

Bravo! This is what I would call REAL speedups;
 
as I understand the result, there is now a Sun 3/60 under SunOS4
with 750kB/s transfer rate and 1.25MB forwarding capacity.
                                               
I do in no way discredit this work by furnishing some remarks to
the mail article and I do not want anybody to do tricky quoting
or implying negative contents. A good artist always merits some
critics, not only thumb handclapping with the crowd.

One remark to the 50% cpu load on forwarding: I guess that is the
fact that the ethernet is at it's capacity limit and so the interface
imposes cpu idle cycles on the transfer activities.

One remark on loop unrolling: a sun is a pipelined processor machine
BUT the compiler seems not to provide for memory prefetches nor for 
optimized loop structures. Only in this context an unrolling of scalar
loops on a single processor machine can have results.

One remark on "overhead" : if you have a slow scalar performing machine
with some simple compiler that does not know about advantageous
memory fetch instruction placing, the loop overhead seems to be
small with respect to the instruction completion times within the
loop. But this is not small loop overhead!

A remark to the broken Cray's: it is absolutely true that on a fast
machine with sophisticated instruction buffering, with compilers
that take full advantage of memory "oddities" and with multiple 
parallel working functional units per processor the overhead of
the return to the first instruction of a loop becomes prohibitive 
if the next instruction lies out of memory sequences and out of 
instruction buffers and the like. But the overhead is only big relatively
to the instruction time spent within the loop which is already smaller
than the sum of the executed instructions which seems not to be true
in the case of the Sun. ( I intentionally left vectorization apart since
this does not exist on the Sun)  

So Your school is perfectly right - for Your choosen combination of
software (compiler, virtual memory, processor architecture ... ) and 
hardware. 
And that's it!

The fact that just 1kB segments seem to be the optimum copy size 
seems to me more a Sun specific sort of resonance effect between 
memory transfers, virtual memory managing and interface readyness
ON THE LOCAL MACHINE. 

As a physicist I learned to look very critically at experiments and their
setups and I like to point out that Your performance test between two
I guess equally optimized Suns certainly shows Your great competence
on the field and the very true betterments Your code changes brought to
the Sun OS4. But the experiment strictly only holds for Your configuration
of two machines that operate at their - equal !!! - optimum configuration
of packet munching parameters. This is sort of a formula I race. 

In a network You have to deal with what You call "broken" partners.
They also work at least close to their respective optimum parameter
configuration - resp. they would like to. Isn't it sort of a racism
to declare that broken?

And in a network You have to deal with the network medium itself.
And no doubt: the bigger the chunks are, the better the network throughput
is. Yes, I know that this is true for the monochromatic case of all equal
packet lengths, but packet size distribution primarily causes disadvantages for
the individual transfer ( small packet waiting for big one to pass).
Now, if You operate a machine that's interface can deliver 100MB but the
fastest channel You have will only gulp 10MB or 50MB (Ether, NSC Hyper),
from the point of writing networking code on such a fast machine You have
an interest to flood the net with a big packet and then continue with
the honest work of letting users use the CPU until the 4MB Working station
eventually may accept another packet. And even under these  circumstances
the networking software has to be as fast as possible to waste as little
resources and cpu cycles as possible since Your CPU does not earn money
by servicing 100MB interfaces with a some GB badnwidth CPU nor by pushing 
around small memory segments but by giving CPU cycles to users' number	 
crunching codes.
(this is a hit on the people that think " oh, the link is only xxxkB/s,
why bother writing a code that could serve yyMB/s").
So on my opinion there are more viewpoints to observe than the 
record breaking perspective in clinical and artificial environment. 


disclaimers as usual (sometimes I have bad conscience because somebody pays
NW costs for my private ognions)
                                                      

Michael.