Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!rutgers!ames!ucbcad!ucbvax!AC.UK!SYSMGR%UK.AC.KCL.PH.IPG
From: SYSMGR%UK.AC.KCL.PH.IPG@AC.UK
Newsgroups: mod.computers.vax
Subject: Re: VAX instruction timing
Message-ID: <8701100344.AA11320@ucbvax.Berkeley.EDU>
Date: Fri, 9-Jan-87 22:44:49 EST
Article-I.D.: ucbvax.8701100344.AA11320
Posted: Fri Jan  9 22:44:49 1987
Date-Received: Sat, 10-Jan-87 03:38:35 EST
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The ARPA Internet
Lines: 97
Approved: info-vax@sri-kl.arpa

Re: How does one time VAX instructions, why don't results make much sense,
    how is one expected to write really fast code without these figures...

The following is a personal opinion that tends to arouse controversy, but here
goes... If you don't want to read 5 or do screenfulls hit delete now...

It is a waste of time to attempt to time individual instructions on
a VAX cpu, or indeed to attempt any code tweaking which does not improve the
algorithm of which the code is a possible realisation.

Why? Well, to start with the only instructions which are likely to have a
consistent execution time are register-to-register operations. You can time
these, quite easily, by setting up a trivial subroutine containing a loop
such as:
		.entry test, #M<r2>
                movl #100000, r2
10$:		xxx
                sobgtr r2,10$
		ret
		.blkb 100        ; space to patch into with DEBUG
		.end

(xxx is the instruction(s) to time). You run it first with xxx deleted to get
a time for the loop overhead, and then with instruction(s) to be timed.
Note that DEBUG can be used to deposit xxx, rather than MACRO and LINK each
time. Timing is accomplished by bracketing this routine with a pair of calls
to SYS$GETJPI returing CPUTIM (if you believe the results) or SYS$GETTIM to
measure realtime, which is reliable if you have elevated your process priority
above the swapper's (ie realtime priority) and provided the environment is
quiet enough that few if any interrupts are being handled. Note that this latter
approach will annoy other users (if any) and requires privilege (ALTPRI).

It is these considerations which lead to my opinion. A more complex piece of
code, such as a memory-to-memory copy of a buffer, has a time which depends both
on virtual memory management (unless you lock the buffers into your working
set, which is usually unrealistic) and on other system hardware activity, which
cannot be avoided.

VM management is tuned by a large number of SYSGEN parameters, 'correct' setting
of which is a subject of much debate amongst system managers. In fact there are
no right values, the 'best' values depend on your system configuration,
type of workload, management objectives, etc. The CPU time taken by the buffer
copy must include page fault overheads to be realistic, but these depend both on
your SYSGEN, and also on how busy the system is when you run your test. If there
are no other users, pages of memory lost from your process working set simply
sit around in semiconductor memory until they are next needed (unless you exceed
the physical memory size of your VAX), and are retrived with a 'soft' page
fault which involves no disc IO. In contrast, if the system is busy, a page
lost by your process will probably be written back to disc and grabbed by
some other process, so you incur a second disc IO when you next fault it in.
'Hard' page faults take a lot longer (in both CPU and real time) than 'soft'
ones.

As for hardware activity, only one device can be a bus master at once. If
a DMA transfer to memory is in progress, the CPU may have to wait for one
or more bus cycles until it can become bus master and access memory. Whether
this is significant depends on the amount of DMA activity and on the
bandwidth of the main bus (SBI, BI, CMI) and its interaction with the device
bus (UNIBUS, QBUS); I have heard that is is particularly significant with
a UNIBUS adapter on a BI-bus VAX.

If this sounds rather theoretical, I know that on our VAX some jobs can take
twice as much CPU on a busy system than on a lightly-loaded one. When users
complain, I point out that this effect is common in the real world; most
companies offer discounts to customers for off-peak resources or to shift
less popular products. I also noticed that when we doubled our physical
memory, many jobs got lots faster, and when we went to VMS V4, many jobs
slowed down, both of which make rather a nonsense of CPU MIP ratings.

Note that my opinion only applies to instruction tweaking that does not
improve the algorithm. A while ago (on VMS 1.5 I think!) I transferred a
two-dimensional FFT routine from a CDC 7600 to out VAX. On the 7600,
hand-coded assembler in the inner loops reduced runtime by 60%. On the
VAX, I achieved 10%, and it wasn't because the compiler was optimising
registers well enough already - I took a look at the compiler code and
shuddered! In contrast, an algorithmic improvement that reduced the
number of complex multiplies involved in the computation saved much the
same percentage on the 7600 and on the VAX, and another that accessed
memory in a more sequential manner (thereby reducing pagefaults) paid
handsomely.

In summary, VAXes are high-level machines and a low-level approach to
optimisation can be left to DEC's compiler writers who are now doing a good
job. Where programmers can score is by improving their algorithms, and
by understanding the basic priciples of VM management so as to work with
VMS rather than against it. System programmers likewise usually do better to
study the VMS internals manual than to worry about whether two
BBS instructions are better or worse than a BITL and a BEQL.

Incidentally this is probably equally true of any other virtual memory system.
MIPs are a waste of time for everybody except salesmen ... "Figures can't
lie, but liars sure can figure".

Nigel Arnot (Dept. Physics, Kings college, Univ. of London;  U.K)

Bitnet/NetNorth/Earn:   sysmgr@ipg.ph.kcl.ac.uk (or) sysmgr%kcl.ph.vaxa@ac.uk
       Arpa         :   sysmgr%ipg.ph.kcl.ac.uk@ucl-cs.arpa