Path: utzoo!utgpu!attcan!uunet!lll-winken!lll-tis!helios.ee.lbl.gov!pasteur!ucbvax!rusvx1.rus.uni-stuttgart.dbp.de!nittmann From: nittmann@rusvx1.rus.uni-stuttgart.dbp.de ("Michael F.H. Nittmann ") Newsgroups: comp.protocols.tcp-ip Subject: some interim notes on the bsd network speedups Message-ID: <155:nittmann@rusvx1.rus.uni-stuttgart.dbp.de> Date: 9 Aug 88 03:31:36 GMT Sender: daemon@ucbvax.BERKELEY.EDU Organization: The Internet Lines: 93 Bravo! This is what I would call REAL speedups; as I understand the result, there is now a Sun 3/60 under SunOS4 with 750kB/s transfer rate and 1.25MB forwarding capacity. I do in no way discredit this work by furnishing some remarks to the mail article and I do not want anybody to do tricky quoting or implying negative contents. A good artist always merits some critics, not only thumb handclapping with the crowd. One remark to the 50% cpu load on forwarding: I guess that is the fact that the ethernet is at it's capacity limit and so the interface imposes cpu idle cycles on the transfer activities. One remark on loop unrolling: a sun is a pipelined processor machine BUT the compiler seems not to provide for memory prefetches nor for optimized loop structures. Only in this context an unrolling of scalar loops on a single processor machine can have results. One remark on "overhead" : if you have a slow scalar performing machine with some simple compiler that does not know about advantageous memory fetch instruction placing, the loop overhead seems to be small with respect to the instruction completion times within the loop. But this is not small loop overhead! A remark to the broken Cray's: it is absolutely true that on a fast machine with sophisticated instruction buffering, with compilers that take full advantage of memory "oddities" and with multiple parallel working functional units per processor the overhead of the return to the first instruction of a loop becomes prohibitive if the next instruction lies out of memory sequences and out of instruction buffers and the like. But the overhead is only big relatively to the instruction time spent within the loop which is already smaller than the sum of the executed instructions which seems not to be true in the case of the Sun. ( I intentionally left vectorization apart since this does not exist on the Sun) So Your school is perfectly right - for Your choosen combination of software (compiler, virtual memory, processor architecture ... ) and hardware. And that's it! The fact that just 1kB segments seem to be the optimum copy size seems to me more a Sun specific sort of resonance effect between memory transfers, virtual memory managing and interface readyness ON THE LOCAL MACHINE. As a physicist I learned to look very critically at experiments and their setups and I like to point out that Your performance test between two I guess equally optimized Suns certainly shows Your great competence on the field and the very true betterments Your code changes brought to the Sun OS4. But the experiment strictly only holds for Your configuration of two machines that operate at their - equal !!! - optimum configuration of packet munching parameters. This is sort of a formula I race. In a network You have to deal with what You call "broken" partners. They also work at least close to their respective optimum parameter configuration - resp. they would like to. Isn't it sort of a racism to declare that broken? And in a network You have to deal with the network medium itself. And no doubt: the bigger the chunks are, the better the network throughput is. Yes, I know that this is true for the monochromatic case of all equal packet lengths, but packet size distribution primarily causes disadvantages for the individual transfer ( small packet waiting for big one to pass). Now, if You operate a machine that's interface can deliver 100MB but the fastest channel You have will only gulp 10MB or 50MB (Ether, NSC Hyper), from the point of writing networking code on such a fast machine You have an interest to flood the net with a big packet and then continue with the honest work of letting users use the CPU until the 4MB Working station eventually may accept another packet. And even under these circumstances the networking software has to be as fast as possible to waste as little resources and cpu cycles as possible since Your CPU does not earn money by servicing 100MB interfaces with a some GB badnwidth CPU nor by pushing around small memory segments but by giving CPU cycles to users' number crunching codes. (this is a hit on the people that think " oh, the link is only xxxkB/s, why bother writing a code that could serve yyMB/s"). So on my opinion there are more viewpoints to observe than the record breaking perspective in clinical and artificial environment. disclaimers as usual (sometimes I have bad conscience because somebody pays NW costs for my private ognions) Michael.