Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!tut.cis.ohio-state.edu!unmvax!ncar!tank!mimsy!haven!vrdxhq!daitc!daitc.daitc.mil From: jkrueger@daitc.daitc.mil (Jonathan Krueger) Newsgroups: comp.databases Subject: Re: Single Server Bottlenecks (was RE:ORACLE REL 6, INGRES REL 6) Message-ID: <523@daitc.daitc.mil> Date: 17 May 89 00:19:45 GMT References: <3241@tank.uchicago.edu> Sender: jkrueger@daitc.daitc.mil Reply-To: jkrueger@daitc.daitc.mil (Jonathan Krueger) Distribution: usa Organization: DTIC Special Projects Office (DTIC-SPO), Alexandria VA Lines: 203 In-reply-to: cs_bob@gsbacd.uchicago.edu In article <3241@tank.uchicago.edu>, cs_bob@gsbacd (R. Kohout?) writes: >My favorite of Mr. Krueger's suggestions: > >> If you want to support multiple fully runnable (COMputable) >> jobs without interference, you need a multiprocessor. Buy one. >> Configure your servers as you find optimal for best throughput >> and fair for INGRES versus non-INGRES applications. >> >Got that? If you want your Ingres 6.1 performance to equal your Ingres 5.0 >performance, Jonathan Krueger recommends that you buy yourself a >multiprocessor. I think your criticisms would have something worthwhile to contribute if you read what I wrote. Material quoted above doesn't lead me to believe that you have. At no time, in the part you cite or at any other point, have I said anything about INGRES 6.x performance, relative to 5.x performance, some absolute standard, or other vendor product. In particular, having no data on relative performance, I have no opinions on it, have not expressed any, and in fact, am not overly concerned with the topic. I take it that you are, and you're unhappy with 6.x performance. That's fine. INGRES 6.x performance issues are appropriate to this group. But they're distinct from the multiprocessor issues. Let me be boringly clear: at no time did I say that you need, want, or should get a multiprocessor to run 6.x, nor do I believe this, for performance reasons or anything else. >I think that you're missing the point. My posting was an attempt to point >out that single server bottlenecks can and do exist. I'm not attacking Ingres, >but the fact remains that under VMS, with its retarded scheduling strategy, >the Ingres 6.1 server can become a true bottleneck. Of course they can, of course it can. Again, if you want to be helpful, now that you've explored the nature of the bottleneck, could you give us estimates of its extent? How often, how bad, how pathological? >If any 5.0 users want to determine whether or not this could happen to >them, I suggest the following. > >a) when you suspect the database activity is heavy, run MONITOR PROCESS/TOPCPU >and look for the ING_BACK_* processes. You should see the busiest backends, >and if you sum the percentage of CPU they're getting, you should get a >rough estimate of the total CPU being provided the backends. > >b) run MONITOR STATE/AVE for a typical working day to get an idea >of the typical CPU load (i.e. the average number of processes in the COM >state throughout the day.) > >If we call the result of a) CPU_USED and remember that it is a percentage >strictly between 0 and 1, and if we call the result of b) CPU_LOAD then >IF CPU_USED > 1 / CPU_LOAD YOU WILL PROBABLY EXPERIENCE A BOTTLENECK >RUNNING A SINGLE INGRES 6.1 SERVER. Well, your units aren't comparable, and that provokes some doubt about the validity of conclusions drawn from the metric. Specifically, what you call CPU_USED, or INGRES processor share, has units sum of processor time used by INGRES processes ---------------------------------------------- clock time over an undefined sampling period (and recall that MONITOR returns snapshots, not sum over time window: the INTERVAL parameter merely sets sampling resolution, not reporting or updating resolution. Over short time windows this can add significant error when the underlying unit is continuous, as in time). The time units cancel yielding a ratio, although not a pure unit, it's a time share, as in timesharing. To be boringly clear, this isn't what you said, it's what I assume you meant. To "sum the percentages of cpu they're getting" is to arrive at a meaningless number; I assume you meant instead to average them. This can be measured and expressed in useful ways, such as deriving it from cpu time over clock time as shown. Now, what you call CPU_LOAD has units sum of number of processes in COM state --------------------------------------- number of samples over a sampling period, in your example a day. This is long enough that the snapshots collected by MONITOR should approach load averages. Thus we can neglect quantization effects, the underlying unit is discrete and we collect many samples. The numbers cancel yielding a ratio which is a pure unit, the average number of COM processes over the time measured. So the two numbers CPU_USED and CPU_LOAD aren't just different measurements, they're not in comparable units. Neither one be expressed in terms of the other. Worse, experience shows that real measurements of the two are not well related; that is, neither is very predictive of the other. Either varies inversely with the other, but not in a well behaved manner. They somewhat complement each other for performance analysis. But not in the way you suggest: CPU_USED > 1 / CPU_LOAD which may be re-expressed as CPU_USED * CPU_LOAD > 1 which, following my notation, means INGRES processor share * load average > 1 Substituting in pure units, and specifying a common time window for data collection, this reads processor time used by INGRES number of COM processes ----------------------------- * ----------------------- > 1 total clock time number of samples Take again the simple case of one fully computable INGRES process and one fully computable non-INGRES process, at equal priorities in a round-robin scheduler. By this metric, "you will probably experience a bottleneck running a single INGRES 6.1 server." In fact, this makes sense, you probably will, although "probably" and "bottleneck" have yet to be expressed quantitatively: how probably and how bad. Now take the case of the two processes staying about half compute bound and half i/o bound. If their i/o and computation overlap well, they'll never see each other; if they don't, they'll interfere with each other exactly as much as in the previous case. But the metric doesn't distinguish between these two sets of conditions. Thus it's easy to show beta error, the metric falsely predicts no problem. Alpha error is even easier to show: consider the case of ten INGRES users and two non-INGRES. Number of COM processes can average 2 or more with 30% idle time. With good overlapping this again means that the next pending INGRES job (process that goes from LEF to COM) will have processor available. VMS priority promotion favors that job over the more recently served and higher computable ones. Thus no problem, but anytime INGRES processes get more than half the used processor time, the metric falsely predicts a problem. Alpha error points out a deeper problem with this analysis: the metric complains about bottlenecks hurting INGRES when INGRES processes are getting the best of things! For an extreme example, consider what happens if we modify priorities so that INGRES processes always pre-empt the non-INGRES. Now let INGRES loads increase to approach 100% use of the processor. The other jobs remain COMputable, in fact they'll wait for processor share indefinitely. It's trivial to put 10 or more other jobs into the run queue, they just pile up waiting for processor because they're not waiting for memory or i/o. The metric now says processor time used by INGRES number of COM processes ----------------------------- * ----------------------- total clock time number of samples n seconds ~11 * (interval / seconds) = ---------------- * -------------------------- n + delta seconds (interval / seconds) which, for small delta (as INGRES loads increase to 100%), = 11 >> 1 The metric predicts bottlenecks, imposed by "single server architecture", where non-INGRES jobs get an unfair boost over INGRES. In point of fact it's exactly the other way around, we've set it up so that the INGRES jobs are beating the stuffing out of the other jobs, but the metric doesn't know this. Of course, all metrics have contexts, and we could say this is outside of the context of usefulness of this metric. If we went into this further, we'd probably agree that part of the context is that the other jobs have to proceed too. That makes us wonder if the context isn't getting and giving fair share of shared use of uniprocessors. So my point is, yes you have a bottleneck, it's just not one of INGRES competing at a disadvantage with other jobs, as you suggest. If you're compute bound on a single shared processor yourr bottleneck is that processor. There are different methods of attaining higher performance, but multiple servers isn't one of them. Greg Pavlov points out one solution: move INGRES to dedicated resources. Another solution is to increase the shared resources, such as more or faster processors. Software can't work miracles and create more resources. All it can do is ration existing resources more fairly, flexibly, optimally, or closely in accordance with local policies and needs. >most of Mr. Krueger's posting is a smokescreen. He asks for hard data, then >himself proceeds with a thouroughly general, theorectical treatment of >scheduling problems. Most of what he says applies equally to all DBMS >systems, Ingres 5.0 as well as 6.1. Thanks, I couldn't ask for a better review. It was my intention to discuss a more general set of issues. Since I'm not making any specific claims, hard data from specific systems would not be appropriate. You, on the other hand, are: do you have any? >My point is that in this respect, >Ingres 6.1 can provide inferior performance to Ingres 5.0. And your point is well taken. Okay, it can. Does it? How often? How inferior? If you want to be helpful, please tell us not only the what and where of bottlenecks but also when and how much. -- Jon --