Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!rutgers!maverick.ksu.ksu.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew
From: aglew@dwarfs.crhc.uiuc.edu (Andy Glew)
Newsgroups: comp.arch
Subject: Re: Computer time measurements (Was Re: 64 bits for times....)
Message-ID: <AGLEW.90Aug22125018@dwarfs.crhc.uiuc.edu>
Date: 22 Aug 90 16:50:18 GMT
References: <26012@bellcore.bellcore.com> <11187@alice.UUCP>
	<AGLEW.90Aug21220304@dwarfs.crhc.uiuc.edu>
	<1990Aug22.044826.18572@portia.Stanford.EDU>
	<MOSS.90Aug22084234@ibis.cs.umass.edu>
Sender: usenet@ux1.cso.uiuc.edu (News)
Organization: University of Illinois, Computer Systems Group
Lines: 93
In-Reply-To: moss@cs.umass.edu's message of 22 Aug 90 12:42:33 GMT


    I do software performance measurement and would *like* resolution down to the
    clock rate of the machine. Personally, I generally want to include the time
    taken by pipeline stalls, cache misses, etc., since that is relevant to the
    user. I guess what I would really like is elapsed (wall clock) time, cpu time
    for the process (split into user and system time), and possibly counters of
    other things (instructions executed, memory cycles (maybe split into
    reads/writes), cache hits/misses, page translation hits/misses, etc.). I don't
    think any of this is necessary *hard* to do, but it does take chip real
    estate. The counters should be readable with ordinary instructions, but maybe
    settable only with special ones (though if kept on a per process basis, a
    process can only screw up itself).


Here's a fairly coherent schema for timers, merged from the best features 
of several machines:
    Provide one, accurate, real time timer.
    Provide an offset register settable by the OS.
    Provide an instruction that atomically reads the real-time timer.
    Provide an instruction that atomically reads the sum of the
    	real-time timer and the offset.
    	This gives you virtual CPU time.  Ideally, you would be able
to read both real and virtual time atomically (and systems like the
i860 LOCK operation let you do that), but you can finesse it as
follows:
    Read Real
    Read Virtual
If you get interrupted between the real and virtual readings, your OS will account for it.

The Gould NP1 provided a machine cycle resolution timer with an offset
register, but only one read operation.  So, the first version of the
OS had an offset, to read virtual time. The second version of the OS
always had the offset as 0, to read real time.  I think eventually that 
it was configured on a per process basis. But everyone wanted both.

A note: the covert channel security guys will want you to provide a
mask to remove the low order bits. They want to prevent high precision
timings being made.


Hardware costs:
    Yes, but there are a lot of things that you can do in software to
reduce the hardware costs.
    As I was trying to say in my first post, I don't want all the hardware
that would be necessary to give me an accurate ns timer - ie. I don't
want the portable interface in hardware.  Give it to me raw.
    The carry chain for fast ticking can be simplified - let me read
the timer in carry-save format! (If you can do big reads this is fine.
If you cannot read twice the width of the timer (carry-save) then
tricks like those below can be applied).


    The most important items for general use are elapsed and cpu time, with
    resolution down to the machine clock cycle time. Except on machines that
    stretch clocks (as opposed to inserting "wait states"), this is not
    technologically difficult, though the number of bits required may necessitate
    an atomic operation to read the counter being sampled into a special read out
    register, than can then be examined at leisure (and similarly for setting).

Ideally, on a 64 bit machine, we will be able to read 64 bit timers
atomically.  (And damn the board designer who puts a timer across an 8
bit interface, so that you have to stop it to read it).

Although, on systems that cannot atomically read the entire timer, I've had
goot luck with timestamps formed as follows:

    Read HIGH-PART -> timestamp.high1
    Read LOW-PART -> timestamp.low
    Read HIGH-PART -> timestamp.high2

If you can guarantee that there is no process interrupt between these
operations (in the kernel that's easy), then postprocessing can
compare high1 and high2.  If same, no problem.  If different, they
usually only differ by one, and with assumptions about how quickly
rollover can occur you can figure out what the true time is.  Of
course, the more of this stuff you have to do, the more LSBs you have
to throw out.
    

    I should add that all of this is useful to me for measuring the speed of
    execution of short blocks of code. I use the numbers to decide on different
    ways of implementing things for advanced programming languages. Repeating
    operations over and over tends to lead to distorted measurements, since
    repeated loops tend to become cache resident more than they might in an actual
    program, etc.

"Measurement of repetition is not repetition of measurement" (Eugene Miya?)
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]