Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hp-pcd!hpcvlx!lonnie From: lonnie@hpcvlx.cv.hp.com (Lonnie Mandigo) Newsgroups: comp.benchmarks Subject: Re: X benchmarks Message-ID: <119970002@hpcvlx.cv.hp.com> Date: 3 Jun 91 19:43:36 GMT References: Organization: Hewlett-Packard Co., Corvallis, OR, USA Lines: 142 > / hpcvlx:comp.benchmarks / jason@cs.utexas.edu (Jason Martin Levitt) / 3:17 pm Jun 2, 1991 / > In article <1991May31.151431.9127@Informatik.TU-Muenchen.DE>, roell@informatik.tu-muenchen.de (Thomas Roell) writes: > Jason writes... > I'll let someone else fight x11perfcompDR vs. xbench. IMHO, neither > provides very useful X performance numbers, but neither is "COMPLETLY [sic] > USELESS" either. There simply is nothing else available in the public > domain yet except equally mediocre tests and personal opinions. I agree with Jason there really isn't anything very good out there for measuring X performance. Those of us who are in the business of publishing numbers in this area are unfortunately forced to work with what we've got. But, rather than cry on your collective shoulders, I offer the following comments for your dining pleasure. Take them for what they're worth... [This is moderately long, so its possibly a good time to move on to the next note :-)] Reference diagram... Single Operation Tests (SOT) | | /------------/ \-------\ Frequency | | data from real Multi-operation Tests . . . . .> Summary of SOT use (via xscope?) | | | Psuedo Application/ Weighted Summary of <---/ Environment SOT | Real Application/ Environment w/script | Real Use The above (nearly impossible to read) diagram describes my method for categorizing X performance tests. Its probably not a heck-of-a-lot different than what might be used for any other kind of performance testing. The raw data produced by most of the tests in the x11perf and xbench suites falls into the Single Operation Test category. In other words, they pick a particular X operation and execute it many times in a particular X environment and then calculate how long that operation takes to execute on average . As has been pointed out earlier, this is really great for tuning up an X server, but tells an end user almost nothing about how his application will perform. A few other factors that are important at this level are the techniques used by the benchmark suite to insure the quality of the data. These include strategies for knowing when an operation has actually completed (i.e. did that line really get drawn or was it sitting in a queue somewhere waiting to get drawn when my Xlib call returned), and the thoroughness in specifying the test environment (is the screen saver turned off, etc.). X11perf is very good here. I have been told by other investigators (and have some experience) that xbench is not as thorough here. This influenced our decision to focus on x11perf. Both x11perfcompDR and xbench follow the right hand path in the above diagram. X11perfcompDR stops at the summary level. Xbench provides a weighted summary. X11perfcompDR is modelled after the technique used by Digital Review Magazine for evaluating X performance. It makes some effort to inject reality into its summary (it eliminates all 1-pixel and 500-pixel tests). Our experience in using x11perfcompDR is that you can generally trust the sign of the difference when making a comparison (if it says that one system is faster than another, it probably is for most applications). To a lesser degree you can trust the magnitude of the difference (if it says that one system is A LOT faster than another system, it probably is for most applications). NEVER use the difference as a multiplier for your particular application, it will ALWAYS be wrong (but you can't always be sure which way it will be wrong). Xbench attempts to make the reliablilty of a comparison somewhat better by weighting the results of the individual tests. Sometimes this can help, but it can also make the problem worse. Xbench uses (intuitively derived) weights that are biased towards text. If your application doesn't happen to be text intensive (e.g. some CAE application) or doesn't happen to use X's text facilities (e.g. some document generation applications) then the numbers provided by xbench may lead you astray. (This doesn't imply that the "unweighted" x11perfcompDR is better. It is implicitly weighted by the distribution of different types of tests.) In general, the same things can be said about xbench as were said about x11perfcompDR. Most of the time its meaningful. Sometimes its not. A better solution for "right path" performance characterizations would be to use something like xscope to find out what real applications really do in a real environment. From this information you could (hopefully) identify various classes of applications. Once the classes were identified then you could weight the measurments appropriately and possibly come up with something that is more likely to be meaningful than what we have now. The "left path" offers some advantages over the "right path". A multi-operation test contains a short (but realistic) sequence of X operations which are executed many times to determine how long it takes to execute that sequence. This is necessary because the state of the display server left by a previous operation can effect the performance of the next operation to be executed. Xbench contains one test which addresses this (complex1). I wish x11perf had some tests like this, but I don't have time to write them. These kinds of tests can be summarized in a fashion similar to single operation tests (Xbench does this). A Psuedo Application/Environment test is some public domain piece of code that attempts to simulate at least the X portion of a particular kind of real application. These psuedo applications may also include other factors which may impact an application's performance such as; disk i/o, intensive computation, or interaction with other simultaneously executing processes (e.g. a window manager). I'm not aware of any X specific tests that fall into this category. The GPC benchmarks for measuring graphics performance might be in this category. (The graphics may be done through X calls but not necessarily). A Real Application/Environment with a fixed script is even better than a Psuedo Application when only the numbers that are generated are considered. Unfortunately, since the code is not public domain other problems creep up. "Does this application run on the platforms that I'm interested in comparing?" or "If I want this to be an officially sanctioned standard am I going to have to pay royalties or require purchase?" or "Which real application performance numbers should be published in everybody's data sheet?", etc. Real Use is, of course, the ultimate benchmark. A real user gets to use a real application in a real environment for a reasonable amount of time so that he can either say "Hey this is great! We really should buy a 1000 of these!" or "This sucks! Get it out of here." ---------------------------------- Lonnie Mandigo Hewlett-Packard Co. Interface Technology Operation Corvallis, OR. lonnie@cv.hp.com