Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!samsung!uunet!convex!newsadm From: patrick@convex.COM (Patrick F. McGehearty) Newsgroups: comp.benchmarks Subject: Re: Which benchmarks are useless? Keywords: benchmarks date statistical correlation Message-ID: <1991Apr29.225104.26828@convex.com> Date: 29 Apr 91 22:51:04 GMT References: <1749@marlin.NOSC.MIL> <2717@spim.mips.COM> <1751@marlin.NOSC.MIL> Sender: newsadm@convex.com (news access account) Reply-To: patrick@convex.COM (Patrick F. McGehearty) Distribution: comp.benchmarks Organization: Convex Computer Corporation, Richardson, Tx. Lines: 53 Nntp-Posting-Host: mozart.convex.com In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes: > >As indicated in the table Dhrystone 1.1 ratio results are greater than the >Integer SPEC ratio results by 14% to 24% with an average of 21% greater. ... Of the systems listed, I believe all but the DEC VAX 11/780 came after Dhrystone started to be a widely quoted benchmark. Because it is a fairly small code, it is easy to use as a tool for focusing compiler tuning efforts. [After all, if you only have limited resources for tuning, you might as well use them where they can be shown to make a difference] I suspect that if DEC were to spend a few months tuning their C compiler (or maybe just rerun their latest release on a 11/780), they also could get another 20% out of the Dhrystone benchmarks. I also would not be surprised to see the SPEC numbers improve in the next few years for existing hardware with better compilers. What gets measured gets worked on. The advantage of benchmark suites like SPEC (and for you number crunchers, the Perfect Club and Slalom benchmarks) is that there is such a variety of coding styles and usages that the improvements are likely to benefit many real codes. In some cases, special tricks will be found that only benefit those codes, but for the most part, improvements will be made that help many programs run faster. Back to the issue that started this discussion: When you propose a new benchmark, consider how vendors will respond if people start using it for serious competitive evaluation. [If you don't want people to use it, why are you proposing it??] Will it encourage the vendors to improve the things you want improved? If not, can it be changed to do so? Show it to a few people (with DRAFT, DO NOT DUPLICATE marked all over it), and get their feedback. Then ask yourself again if it is useful. The 'date' benchmark has a number of serious flaws. A key one is that if the date operation were added to the command shell, it would go many times faster. Since the intent is to measure process spawning time, ... well, you get the point. A similar thing happened to the getpid system call. Some people at Berkeley wanted to know how fast a trivial system call was, so they could tune the syscall interface. They wrote a loop to call getpid() many times. This test was appropriate for their purposes. Later, this test (and many others) was made generally available. Some vendors chose to speed up this test by caching the process id in user space on the first getpid(), and avoiding the system call overhead for subsequent getpid()'s. There is nothing wrong with that optimization, just that it does nothing for real user programs. The main reason I don't like the date benchmark is that it encourages me (as vendor) to fix the wrong things. In addition, as a user, the benchmark does me little good, because I have little confidence that it will measure the same things I care about (at least it won't after the vendors start working on it if they take it seriously). The same reasoning applies to the 'bc' benchmark which ran through this news stream a while ago.