Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!mit-eddie!uw-beaver!cornell!ken From: ken@gvax.cs.cornell.edu (Ken Birman) Newsgroups: comp.sys.isis Subject: Why is ISIS slower than RPC? (Very long, even for me) Message-ID: <52120@cornell.UUCP> Date: 16 Feb 91 01:41:59 GMT Sender: nobody@cornell.UUCP Reply-To: ken@cs.cornell.edu (Ken Birman) Distribution: comp Organization: Cornell Univ. CS Dept, Ithaca NY Lines: 312 We get asked about performance a lot, and today I was just asked by two different groups if I could explain why ISIS is slower both in round-trip latency and CPU costs than SUN RPC. This is a very long message intended for people who, in some cases, are investing a lot of their own time in ISIS... it will not give the whole picture, but it should help you have a sense of what ISIS currently does and what it costs... and why we can do much better when we rebuild the system over a microkernel. If you measure ISIS performance, you will find that figures vary widely dpending on exactly how you measure it. We have some nice graphs in the revised "fast causal multicast" paper, but let me summarize the picture here. First, ISIS runs in bypass or non-bypass mode, and in the latter it is very slow indeed. This is discussed in the manual. Basically, messages get sent via a protocols server and each hop costs something. In bypass mode, within a single group, the picture is more subtle. If you measure the system, you should see an ISIS RPC time of something like 7.6ms round-trip for SUN4-SUN4 on an ethernet, with the sort of degradation one would expect for packet size changes or processor changes. ISIS streaming will be around 650 null messages per second or about 600kbytes/second, which is a little better than TCP. This is for small packets. Actually, some users may see closer to 9ms, because the very first copies of ISIS V2.1 had a problem whereby extra acks went out in some situations. These cost a fortune and we eventually found a way to supress most of them, but the fix was a little too messy to post. It is in V3.0, of course, and has been in V2.1 since the October distribution. ISIS will also be especially poor in local (within one machine) performance, and somewhat closer to SUN RPC in intra-machine figures. ISIS will do better than SUN RPC for byte swapped communications, i.e. SUN to a MIPS system, and for large messages. You may also find that ISIS is spending most of the extra time in system calls and in its memory management and tasking layer. Much worse performance than this indicates a bug in ISIS or your tests. For example, the Los Alamos thing with acks was slowing a test of theirs down from what should have been about 50 multicasts per second to 1 per second. Obviously, this stood out, but if you were to hit this just on one multicast now and then.... Basically, the picture is this. If you don't need what ISIS is doing, the technology is a clear performance loser relative to SUN RPC, but if you do need it, it is much FASTER than to build an equivalent layer over SUN RPC. (Discussion below.) The key question on ISIS today, as opposed to two years out, is whether your primary need is for fault-tolerance, consistency, and distributed control -- or if it is basically for a fast RPC. Let me summarize the main points I usually make on this. First, performance is a relative thing. The rebuild of ISIS on Mach will probably be much faster than SUN RPC can possibly be over UNIX, because ISIS protocols add only about a 2% overhead to "transport" costs, and UNIX severely penalizes application-level transport protocols. By moving ISIS into the Mach communications kernel and hence close to the wire, we can get down to a 1.5-2ms RPC between machines like the SUN4, where we currently are sitting over a 2.5ms UNIX system call stack just to get data on the wire or into our application. SUN-RPC also has some "unfair" advantages. One is that it is optimized for the case of a single channel between two processes and for the case of processes on the same machine. And, it doesn't have any reliability guarantees when the endpoints fail, which ISIS does. ISIS is not optimized within a machine -- it uses UDP which pings to the wire and back due to a poor UNIX implementation. And, it IS optimized for the case of multiple destination multicasts, where it begins to do slightly better than RPC. Thus, if you do 3 or 4 RPC's sequentially and compare with an ISIS multicast using a cheap protocol (cbcast or fbcast in BYPASS mode), we come pretty close. We still lose, but not by much. Second, you need to keep in mind that ISIS is sending big packets due to its flexible message format and its protocol structure. This will be reduced in the reimplementation, but figure on a 440 byte overhead for each packet just because it was sent via ISIS and not directly. For small packets this makes a huge difference; thus, a null SUN RPC versus a null ISIS RPC compares something like 32 bytes with something like 480! Naturally, we suffer in this comparison. Third, SUN RPC has some advantages in the way it deals with memory management and system calls. ISIS tends to need to do a lot of system calls. With 2 channels open (1 to protos, one to the interclient layer), we need a select at least once per message. We also need to call gettimeofday because our protocols are timer based, and we need to call recvfrom/sendto to move the data. UNIX system calls are expensive and this adds up. Further, ISIS is task based, and needs to create the tasks (warm started, but still expensive). SUN RPC over SUN LWP is actually slower than ISIS if you do all these system calls too, but perhaps that is a cheat -- since you normally wouldn't use SUN RPC with two channels, checking the time constantly, and over LWP. Fourth (probably should be zeroth) ISIS is normally used asynchronously. So, you don't often see RPC performance as a limiting factor. RPC measures the latency between nodes, but you get a producer-consumer pipeline effect and if both sides stay busy, the round-trip time is much less of an issue than in RPC, which is not normally used this way. Of course, if you use ISIS as an RPC system, this point doesn't hold -- but if you use it as recommended, say for managing replicated data -- you almost never wait for replies from remote processes. This last point is the one that matters in, say, a brokerage trading floor. If you are distributing quotes, a latency of a few milliseconds between the sender and the destinations is a small cost to pay, and fault-tolerance may be a very big win. The mental picture is like a UNIX pipe... Many applications act this way, and round-trip latency just isn't the major factor for them. So, the picture is that one really shouldn't think of ISIS V3.0 as a blazingly fast data transport system, although it does fairly well considering the powerful group abstraction it supports. Rather, we recommend that people think of ISIS as a "control" mechanism that can be combined with other facilities (even SUN RPC, as in Deceit!) where less semantics are needed and performance is the key thing. The people who are happiest with ISIS generally run applications that communicate infrequently, perhaps a few times per second, and for this, performance is hardly the issue. Now, to back my first point (if you build over RPC it would cost more than you expect), say that you need to send a message reliably to n destinations, i.e. with all or nothing behavior. If you plan to garbage collect the data, it isn't hard to see that you need a 3-phase protocol: sender typical dest Phase 1: 1. send m 2. receive and deliver m, ack 3. collect acks Phase 2: 4. send "done" 5. note "done", ack 6. collect acks Phase 3: 7. send "finished" 8. forget interaction, ack 9. forget interaction The reason for this is that you need to be able to handle failures. In this protocol, if the sender dies, a typical destination can take over the protocol from the stage the sender was in when it last heard from it. I.e., if a destination got a message "m", it can take over as the sender from phase 1 if the sender dies before reaching phase 2. Of course, this could lead to a destination getting duplicates, so destinations need to remember the interaction until it is certain that all have the packet. This is why you need phases 2 and 3... ANY fault-tolerant protocol has a structure like this. The ISIS protocol does too -- but the second and third phase are overlayed on subsequent first-phase communication and so you just don't see it! In fact, to do what ISIS is doing, you really need to do something like 3-n RPC's to send one message to n destinations, unordered. Obviously, ISIS is performing better than this. And, when I say need, I mean that if your application wants to be fault-tolerant, it MUST do the 3*n RPC's! This is because you need enough information after a failure to still finish the protocol and supress duplicates, and you need to eventually forget the interaction or the state piles up. This raises a real question of whether one should actually compare ISIS performance to one SUN RPC... or to 3 sequential ones.... For one destination, this will seem to be a crazy statement, but for 4 destinations, the answer is that 1 ISIS cbcast does work comparable to what you would need something like 12 RPC's to do! I have some data on ISIS performance, and Pat Stephenson has more. You will see that little of the time is spent in anything having to do with protocols or cbcast or anything (well, abcast is a little worse). The performance figures are dominated by the tasking subsystem and message library (about 8% of total costs) and by the UNIX system calls and transport of the data (about 85%). The remainder (about 2%) is spent in our protocol layer. Some nice histograms and graphs of multicast performance as a function of number of destinations, message size, etc. will be in the revised "fast causal multicast" TR when we re-release it (~ next week). We can provide that data if people would like to see it. It will also be in the V3.0 manual. All this argues that our rebuild of ISIS should be able to do much better, but also that there isn't much hope for a much faster version of ISIS over UNIX. I can't reduce the number of system calls, and ISIS tasks are two to three times faster than SUN LWP for most operations. This leaves the message library, which probably isn't as fast as it should be, and the ISIS windowing communication protocol, which is fast enough but not using a very efficient message representation and hence is putting these big headers on things (actually, 220 bytes per message, but it always puts the user's message inside a transport message for 440 bytes minimum). But, fixing this would only cut that 7.6ms figure to about 7.4ms... SO... I would tell people not to plan on building anything that needs to run super-fast and isn't actually in need of ISIS semantics directly over ISIS -- control it with ISIS, perhaps, but use something cheaper for the speed critical paths. But, I would also point out that the bet changes if you look at ISIS in the future, over Mach or Chorus, because we should actually be able to eliminate almost all the idiocies of the current UNIX platform in the context of the Mach network message server or Chorus kernel. Pure RPC may still be faster, but not much so. What follows are our last performance measurements on the SUN4/bypass code. They do reflect a bug fix you might not have, because is was made at the last minute. So, the earliest V2.1 copies might be a little slower and will be seen to be sending too many ack packets if you look at a client dump. V3.0 should be comparable. ISIS V2.1 cost figures (but with extra-ack problem fixed). These figures are for BYPASS communication only. Non-BYPASS is about 3-5 times slower for most things, 10-times slower for some especially unfortunate cases. Operation Cost Task create-destroy (null proc.) 170us (= t_fork_urgent, null proc.) Task switch 103us (= t_wait + t_sig + task_swtch) Null message create/destroy 24us Same but use msg_gen("") 29us (i.e. 5us to scan null format) Same but use msg_gen("%d") +14us Same but use msg_gen("%C", &x, 0) +12.1us Same but use msg_gen("%C", &x, 8k) +1130us Same but use msg_gen("%*C", &x, 8k) +28us RPC to self: cbcast for request, reply (both null) 1430us (1.571 before "bug fix") cbcast but inhibit "rcv" 406us (hacked ISIS for this) ... est. for rcv, cbcast reply 1024us ... est. for rcv a cbcast, no reply 309us (i.e. 1430 = 2*406 + 2*309) fbcast for both 1195us fbcast but discard message on "rcv" 320us ... est. for rcv, fbcast reply 795us ... est. for rcv a fbcast no, reply 239us (i.e. 1195 = 2*320 + 2*239) >From these figures it will be unclear why sending a cbcast is so costly, since one would think we are just generating a message, putting some fields in it, and firing it off... actually, this is true, but the message ends up having a lot of fields in it (at 25-50us each) and we do a task switch (at 103us) and we need to run through the "BCAST" routine. A profile shows that the time is spread fairly uniformly: Top ten routines What they do bypass_send 95us Puts VT stuff in message isis_runtasks 88us Picks task to switch to task_swtch 65us Pure switch BCAST 61us generates msg, calls bypass_send invoke 57us Part of new-task create mallocate 48us Memory allocation qu_add 46us queue node allocate and append msg_deallocate 43us action routine for msg_delete() do_bcast 35us parses cbcast_l options, calls BCAST() msg_insertfield 35us action routine for msg_put() Total for 10 612us I used the same approach to measure the costs associated with the ISIS layer that does reliable interclient communication ("TCP over UDP", more or less) This is the layer BELOW the bypass send and ABOVE the various UNIX system calls. It looks like this Sender Recv gen packet net_send build intercl_packet insert header gettimeofday UDP xmit (sendto) ----------------> select gettimeofday recvfrom msg_reconstruct net_rcv unpack messages deliver For this test the sender and destination were on the same Sparc 1. Size (Xmited) Cost Overhead: gen packet Overhead: UNIX system calls 0 (+200) 2884 us 148 us 1264 us 512 (+200) 3245 us 148 us 1363 us 1k (+200) 3445 us 148 us 1590 us 2k (+200) 4111 us 148 us 2055 us 4k (+200) 5276 us 148 us 2984 us ... conclusions: unclear, but we could fill in the "loop" slide a bit more accurately from this. Here is a picture of a typical 1-way CBCAST from a sender to the destination, split (estimated) to show where time is spent. 1K IPC (two processes, same host) send rcv cbcast "gen" 191=35+61+95 24us task subsys 206us=103*2 500us=170+103*2+misc. ----------------------------------- intercl code 1855 us ----------------------------------- UNIX sys calls 1590 us = 3* gettimeofday + sendto + select + recvfrom ----------------------------------- Actual IO kernal bcopy, hidden within "UNIX sys costs" I only have one machine at home so I didn't run this for a remote destination. The various test programs are in my gvax account: itest.c -- intercl test udp.c -- system call costs looptest.c -- gets most of the "table" of numbers profile.c -- for profiled test ... all the above is for a Sparc 1 with isis compiled -O3 -DBYPASS (not a Sparc1+).