Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!maverick.ksu.ksu.edu!rutgers!mcdchg!motmpl!ron
From: ron@motmpl.UUCP (Ron Widell)
Newsgroups: comp.sys.m68k
Subject: Re: 68020 cache and loops
Message-ID: <1869@motmpl.UUCP>
Date: 13 Nov 90 22:31:26 GMT
References: <1990Nov7.204723.4072@Matrix.COM>
Reply-To: ron@motmpl.UUCP (Ron Widell)
Organization: Motorola Semiconductor, Minneapolis, MN.
Lines: 81

Steve Morris (srm@matrx.matrix.com) writes:
> 
> Some folks have claimed that
> 
>             Loop:
>                 ...                             ; whatever
>                 subq    #1,<count_reg>          ; decrement counter reg
>                 bne     Loop                    ; branch if not zero
> 
> 
> is faster than
> 
>             Loop:
>                 ...                             ; whatever
>                 dbra    <count_reg>,Loop        ; branch if not -1
> 
Let's examine a certain pathological case to see if this can be true.
If, in a paged/virtual memory system (which necessarily presumes the use of a
68851 PMMU or other other harware to trap page boundary violations) we have:
Logical Page N, physically resident----------------------------------
|             Loop:
|                 ...                             ; whatever
|                 dbra    <count_reg>,Loop        ; branch if not -1
---------------------------------------------------------------------
Logical Page N+1, not physically resident----------------------------
|                 instruction logically following dbra (via PC increment)
---------------------------------------------------------------------
We discover that a Bus Error will be generated by the PMMU (or other
hardware). However, a Bus Error exception *WILL NOT* be taken at this time;
rather the valid bit in the tag field of the cache line will be cleared. So
we can ignore exception processing as a source of overhead.
Also, since the execution unit and the bus controller are pretty well decoupled,
and neither of the branch instructions will generate any bus traffic; the
branch will complete and the sequencer will issue the next instruction (the
top of the loop) while the prefetch to non-resident memory is taking place;
IFF the branch is < 256 bytes (we'll discuss the other case later). Thus, the
prefetch does not stall the pipe, so (counting cycles) we find from the manual
that the relevant instructions times are:
                   Best Case           Cache Case             Worst Case
subq.w #1,Dn           0                    2                      3
bne Loop (taken)       3                    6                      9
vs.
dbra Dn,Loop           3                    6                      9

It is much more likely that the dbra instruction can take advantage of overlap
due to bus activity from a previous instruction, since any overlap in the
first case would really show up during the subq instruction (and the I-pipe
is only 3 words deep). Thus for cases where the loop fits entirely in cache,
I would expect case #2 to be at least as fast as #1, perhaps faster.

In those cases where the loop *DOES NOT* fit entirely in cache, we will have
additional latency for both cases because we will wait for the prefetch cycle
to complete (via the *BERR signal) prior to initiating the fetch for the
instruction at the top of the loop. Note that in this example we will get a
page fault inside the loop for case #1, rather than after the loop as in case
#2. Here I would really expect #1 to be slower, as we *WILL* do Bus Error
exception processiong. Also note that *BOTH* instructions have prefetch, not
just dbra.
> 
> because the next instruction following the 'dbra' instruction
> is always prefetched by the 68020 (and never cached).
> 
Prefetching is always occurring on the 68020 (assuming memory bandwidth is
available), except when you use the 'sync' instruction (officially known as
NOP). And if a valid access to memory occurs, the instruction is cached,
provided the cache is enabled both by hardware and software.

An additional case where #1 may be faster is where the branch displacement is
such that we can use a byte (seven bits plus sign) displacement value. But
this was not suggested by your example.

> Can someone shed some light on this?
>
Hopefully, this helped.

Regards,
-- 
Ron Widell, Field Applications Eng.	|UUCP: {...}mcdchg!motmpl!ron
Motorola Semiconductor Products, Inc.,	|Voice:(612)941-6800
9600 W. 76th St., Suite G		| I'm from Silicon Tundra,
Eden Prairie, Mn. 55344 -3718		| what could I know?