Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!hp-pcd!hpcvlx!lonnie
From: lonnie@hpcvlx.cv.hp.com (Lonnie Mandigo)
Newsgroups: comp.benchmarks
Subject: Re: X benchmarks
Message-ID: <119970002@hpcvlx.cv.hp.com>
Date: 3 Jun 91 19:43:36 GMT
References: <rwtucker.675641231@starbase>
Organization: Hewlett-Packard Co., Corvallis, OR, USA
Lines: 142

> / hpcvlx:comp.benchmarks / jason@cs.utexas.edu (Jason Martin Levitt) /  3:17 pm  Jun  2, 1991 /
> In article <1991May31.151431.9127@Informatik.TU-Muenchen.DE>, roell@informatik.tu-muenchen.de (Thomas Roell) writes:
>

Jason writes...

>    I'll let someone else fight x11perfcompDR vs. xbench. IMHO, neither 
> provides very useful X performance numbers, but neither is "COMPLETLY [sic]
> USELESS" either. There simply is nothing else available in the public 
> domain yet except equally mediocre tests and personal opinions.

I agree with Jason there really isn't anything very good out there
for measuring X performance.  Those of us who are in the business
of publishing numbers in this area are unfortunately forced to work
with what we've got.

But, rather than cry on your collective shoulders, I offer the 
following comments for your dining pleasure.  Take them for what
they're worth... [This is moderately long, so its possibly a good 
time to move on to the next note :-)]

Reference diagram...


                    Single Operation Tests (SOT)
			|	    |
	   /------------/           \-------\          Frequency
           |                                |	       data from real
   Multi-operation Tests . . . . .> Summary of SOT     use (via xscope?)
           |    			    |          	    |
   Psuedo Application/ 	       	    Weighted Summary of <---/
        Environment                        SOT
           |
   Real Application/
     Environment w/script
           |
       Real Use


The above (nearly impossible to read) diagram describes my method for
categorizing X performance tests.  Its probably not a heck-of-a-lot
different than what might be used for any other kind of performance
testing. 

The raw data produced by most of the tests in the x11perf and xbench
suites falls into the Single Operation Test category.  In other words,
they pick a particular X operation and execute it many times in a
particular X environment and then calculate how long that operation
takes to execute on average .  As has been pointed out earlier, this
is really great for tuning up an X server, but tells an end user
almost nothing about how his application will perform.

A few other factors that are important at this level are the
techniques used by the benchmark suite to insure the quality of the
data.  These include strategies for knowing when an operation has
actually completed (i.e.  did that line really get drawn or was it
sitting in a queue somewhere waiting to get drawn when my Xlib call
returned), and the thoroughness in specifying the test environment (is
the screen saver turned off, etc.).  X11perf is very good here.  I
have been told by other investigators (and have some experience) that
xbench is not as thorough here.  This influenced our decision to focus
on x11perf.

Both x11perfcompDR and xbench follow the right hand path in the above
diagram.  X11perfcompDR stops at the summary level.  Xbench provides a
weighted summary.

X11perfcompDR is modelled after the technique used by Digital Review
Magazine for evaluating X performance.  It makes some effort to inject
reality into its summary (it eliminates all 1-pixel and 500-pixel
tests).  Our experience in using x11perfcompDR is that you can
generally trust the sign of the difference when making a comparison
(if it says that one system is faster than another, it probably is for
most applications).  To a lesser degree you can trust the magnitude of
the difference (if it says that one system is A LOT faster than
another system, it probably is for most applications).  NEVER use the
difference as a multiplier for your particular application, it will 
ALWAYS be wrong (but you can't always be sure which way it will be 
wrong).

Xbench attempts to make the reliablilty of a comparison somewhat
better by weighting the results of the individual tests.  Sometimes
this can help, but it can also make the problem worse.  Xbench uses
(intuitively derived) weights that are biased towards text.  If your
application doesn't happen to be text intensive (e.g. some CAE
application) or doesn't happen to use X's text facilities (e.g. some
document generation applications) then the numbers provided by xbench
may lead you astray.  (This doesn't imply that the "unweighted"
x11perfcompDR is better.  It is implicitly weighted by the
distribution of different types of tests.)  In general, the same
things can be said about xbench as were said about x11perfcompDR.
Most of the time its meaningful.  Sometimes its not.

A better solution for "right path" performance characterizations would
be to use something like xscope to find out what real applications
really do in a real environment.  From this information you could
(hopefully) identify various classes of applications.  Once the
classes were identified then you could weight the measurments
appropriately and possibly come up with something that is more likely
to be meaningful than what we have now.

The "left path" offers some advantages over the "right path".  A
multi-operation test contains a short (but realistic) sequence of X
operations which are executed many times to determine how long it takes
to execute that sequence.  This is necessary because the state of the
display server left by a previous operation can effect the performance
of the next operation to be executed.  Xbench contains one test which
addresses this (complex1).  I wish x11perf had some tests like this,
but I don't have time to write them.  These kinds of tests can be 
summarized in a fashion similar to single operation tests (Xbench 
does this).

A Psuedo Application/Environment test is some public domain piece of
code that attempts to simulate at least the X portion of a particular 
kind of real application.  These psuedo applications may also include 
other factors which may impact an application's performance such as; 
disk i/o, intensive computation, or interaction with other simultaneously
executing processes (e.g. a window manager).  I'm not aware of any X
specific tests that fall into this category.  The GPC benchmarks for 
measuring graphics performance might be in this category. (The graphics
may be done through X calls but not necessarily).

A Real Application/Environment with a fixed script is even better than
a Psuedo Application when only the numbers that are generated are
considered.  Unfortunately, since the code is not public domain other
problems creep up.  "Does this application run on the platforms that
I'm interested in comparing?" or "If I want this to be an officially
sanctioned standard am I going to have to pay royalties or require
purchase?" or "Which real application performance numbers should be
published in everybody's data sheet?", etc.

Real Use is, of course, the ultimate benchmark.  A real user gets to
use a real application in a real environment for a reasonable amount
of time so that he can either say "Hey this is great!  We really should
buy a 1000 of these!" or "This sucks! Get it out of here."

----------------------------------
Lonnie Mandigo
Hewlett-Packard Co.
Interface Technology Operation
Corvallis, OR.
lonnie@cv.hp.com