Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!ucsd!rutgers!maverick.ksu.ksu.edu!ux1.cso.uiuc.edu!ux1.cso.uiuc.edu!aglew From: aglew@dwarfs.crhc.uiuc.edu (Andy Glew) Newsgroups: comp.arch Subject: Re: Computer time measurements (Was Re: 64 bits for times....) Message-ID: Date: 22 Aug 90 16:50:18 GMT References: <26012@bellcore.bellcore.com> <11187@alice.UUCP> <1990Aug22.044826.18572@portia.Stanford.EDU> Sender: usenet@ux1.cso.uiuc.edu (News) Organization: University of Illinois, Computer Systems Group Lines: 93 In-Reply-To: moss@cs.umass.edu's message of 22 Aug 90 12:42:33 GMT I do software performance measurement and would *like* resolution down to the clock rate of the machine. Personally, I generally want to include the time taken by pipeline stalls, cache misses, etc., since that is relevant to the user. I guess what I would really like is elapsed (wall clock) time, cpu time for the process (split into user and system time), and possibly counters of other things (instructions executed, memory cycles (maybe split into reads/writes), cache hits/misses, page translation hits/misses, etc.). I don't think any of this is necessary *hard* to do, but it does take chip real estate. The counters should be readable with ordinary instructions, but maybe settable only with special ones (though if kept on a per process basis, a process can only screw up itself). Here's a fairly coherent schema for timers, merged from the best features of several machines: Provide one, accurate, real time timer. Provide an offset register settable by the OS. Provide an instruction that atomically reads the real-time timer. Provide an instruction that atomically reads the sum of the real-time timer and the offset. This gives you virtual CPU time. Ideally, you would be able to read both real and virtual time atomically (and systems like the i860 LOCK operation let you do that), but you can finesse it as follows: Read Real Read Virtual If you get interrupted between the real and virtual readings, your OS will account for it. The Gould NP1 provided a machine cycle resolution timer with an offset register, but only one read operation. So, the first version of the OS had an offset, to read virtual time. The second version of the OS always had the offset as 0, to read real time. I think eventually that it was configured on a per process basis. But everyone wanted both. A note: the covert channel security guys will want you to provide a mask to remove the low order bits. They want to prevent high precision timings being made. Hardware costs: Yes, but there are a lot of things that you can do in software to reduce the hardware costs. As I was trying to say in my first post, I don't want all the hardware that would be necessary to give me an accurate ns timer - ie. I don't want the portable interface in hardware. Give it to me raw. The carry chain for fast ticking can be simplified - let me read the timer in carry-save format! (If you can do big reads this is fine. If you cannot read twice the width of the timer (carry-save) then tricks like those below can be applied). The most important items for general use are elapsed and cpu time, with resolution down to the machine clock cycle time. Except on machines that stretch clocks (as opposed to inserting "wait states"), this is not technologically difficult, though the number of bits required may necessitate an atomic operation to read the counter being sampled into a special read out register, than can then be examined at leisure (and similarly for setting). Ideally, on a 64 bit machine, we will be able to read 64 bit timers atomically. (And damn the board designer who puts a timer across an 8 bit interface, so that you have to stop it to read it). Although, on systems that cannot atomically read the entire timer, I've had goot luck with timestamps formed as follows: Read HIGH-PART -> timestamp.high1 Read LOW-PART -> timestamp.low Read HIGH-PART -> timestamp.high2 If you can guarantee that there is no process interrupt between these operations (in the kernel that's easy), then postprocessing can compare high1 and high2. If same, no problem. If different, they usually only differ by one, and with assumptions about how quickly rollover can occur you can figure out what the true time is. Of course, the more of this stuff you have to do, the more LSBs you have to throw out. I should add that all of this is useful to me for measuring the speed of execution of short blocks of code. I use the numbers to decide on different ways of implementing things for advanced programming languages. Repeating operations over and over tends to lead to distorted measurements, since repeated loops tend to become cache resident more than they might in an actual program, etc. "Measurement of repetition is not repetition of measurement" (Eugene Miya?) -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]