Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!rutgers!ames!ucbcad!ucbvax!AC.UK!SYSMGR%UK.AC.KCL.PH.IPG From: SYSMGR%UK.AC.KCL.PH.IPG@AC.UK Newsgroups: mod.computers.vax Subject: Re: VAX instruction timing Message-ID: <8701100344.AA11320@ucbvax.Berkeley.EDU> Date: Fri, 9-Jan-87 22:44:49 EST Article-I.D.: ucbvax.8701100344.AA11320 Posted: Fri Jan 9 22:44:49 1987 Date-Received: Sat, 10-Jan-87 03:38:35 EST Sender: daemon@ucbvax.BERKELEY.EDU Organization: The ARPA Internet Lines: 97 Approved: info-vax@sri-kl.arpa Re: How does one time VAX instructions, why don't results make much sense, how is one expected to write really fast code without these figures... The following is a personal opinion that tends to arouse controversy, but here goes... If you don't want to read 5 or do screenfulls hit delete now... It is a waste of time to attempt to time individual instructions on a VAX cpu, or indeed to attempt any code tweaking which does not improve the algorithm of which the code is a possible realisation. Why? Well, to start with the only instructions which are likely to have a consistent execution time are register-to-register operations. You can time these, quite easily, by setting up a trivial subroutine containing a loop such as: .entry test, #M movl #100000, r2 10$: xxx sobgtr r2,10$ ret .blkb 100 ; space to patch into with DEBUG .end (xxx is the instruction(s) to time). You run it first with xxx deleted to get a time for the loop overhead, and then with instruction(s) to be timed. Note that DEBUG can be used to deposit xxx, rather than MACRO and LINK each time. Timing is accomplished by bracketing this routine with a pair of calls to SYS$GETJPI returing CPUTIM (if you believe the results) or SYS$GETTIM to measure realtime, which is reliable if you have elevated your process priority above the swapper's (ie realtime priority) and provided the environment is quiet enough that few if any interrupts are being handled. Note that this latter approach will annoy other users (if any) and requires privilege (ALTPRI). It is these considerations which lead to my opinion. A more complex piece of code, such as a memory-to-memory copy of a buffer, has a time which depends both on virtual memory management (unless you lock the buffers into your working set, which is usually unrealistic) and on other system hardware activity, which cannot be avoided. VM management is tuned by a large number of SYSGEN parameters, 'correct' setting of which is a subject of much debate amongst system managers. In fact there are no right values, the 'best' values depend on your system configuration, type of workload, management objectives, etc. The CPU time taken by the buffer copy must include page fault overheads to be realistic, but these depend both on your SYSGEN, and also on how busy the system is when you run your test. If there are no other users, pages of memory lost from your process working set simply sit around in semiconductor memory until they are next needed (unless you exceed the physical memory size of your VAX), and are retrived with a 'soft' page fault which involves no disc IO. In contrast, if the system is busy, a page lost by your process will probably be written back to disc and grabbed by some other process, so you incur a second disc IO when you next fault it in. 'Hard' page faults take a lot longer (in both CPU and real time) than 'soft' ones. As for hardware activity, only one device can be a bus master at once. If a DMA transfer to memory is in progress, the CPU may have to wait for one or more bus cycles until it can become bus master and access memory. Whether this is significant depends on the amount of DMA activity and on the bandwidth of the main bus (SBI, BI, CMI) and its interaction with the device bus (UNIBUS, QBUS); I have heard that is is particularly significant with a UNIBUS adapter on a BI-bus VAX. If this sounds rather theoretical, I know that on our VAX some jobs can take twice as much CPU on a busy system than on a lightly-loaded one. When users complain, I point out that this effect is common in the real world; most companies offer discounts to customers for off-peak resources or to shift less popular products. I also noticed that when we doubled our physical memory, many jobs got lots faster, and when we went to VMS V4, many jobs slowed down, both of which make rather a nonsense of CPU MIP ratings. Note that my opinion only applies to instruction tweaking that does not improve the algorithm. A while ago (on VMS 1.5 I think!) I transferred a two-dimensional FFT routine from a CDC 7600 to out VAX. On the 7600, hand-coded assembler in the inner loops reduced runtime by 60%. On the VAX, I achieved 10%, and it wasn't because the compiler was optimising registers well enough already - I took a look at the compiler code and shuddered! In contrast, an algorithmic improvement that reduced the number of complex multiplies involved in the computation saved much the same percentage on the 7600 and on the VAX, and another that accessed memory in a more sequential manner (thereby reducing pagefaults) paid handsomely. In summary, VAXes are high-level machines and a low-level approach to optimisation can be left to DEC's compiler writers who are now doing a good job. Where programmers can score is by improving their algorithms, and by understanding the basic priciples of VM management so as to work with VMS rather than against it. System programmers likewise usually do better to study the VMS internals manual than to worry about whether two BBS instructions are better or worse than a BITL and a BEQL. Incidentally this is probably equally true of any other virtual memory system. MIPs are a waste of time for everybody except salesmen ... "Figures can't lie, but liars sure can figure". Nigel Arnot (Dept. Physics, Kings college, Univ. of London; U.K) Bitnet/NetNorth/Earn: sysmgr@ipg.ph.kcl.ac.uk (or) sysmgr%kcl.ph.vaxa@ac.uk Arpa : sysmgr%ipg.ph.kcl.ac.uk@ucl-cs.arpa