Path: utzoo!attcan!uunet!zaphod.mps.ohio-state.edu!maverick.ksu.ksu.edu!rutgers!mcdchg!motmpl!ron From: ron@motmpl.UUCP (Ron Widell) Newsgroups: comp.sys.m68k Subject: Re: 68020 cache and loops Message-ID: <1869@motmpl.UUCP> Date: 13 Nov 90 22:31:26 GMT References: <1990Nov7.204723.4072@Matrix.COM> Reply-To: ron@motmpl.UUCP (Ron Widell) Organization: Motorola Semiconductor, Minneapolis, MN. Lines: 81 Steve Morris (srm@matrx.matrix.com) writes: > > Some folks have claimed that > > Loop: > ... ; whatever > subq #1, ; decrement counter reg > bne Loop ; branch if not zero > > > is faster than > > Loop: > ... ; whatever > dbra ,Loop ; branch if not -1 > Let's examine a certain pathological case to see if this can be true. If, in a paged/virtual memory system (which necessarily presumes the use of a 68851 PMMU or other other harware to trap page boundary violations) we have: Logical Page N, physically resident---------------------------------- | Loop: | ... ; whatever | dbra ,Loop ; branch if not -1 --------------------------------------------------------------------- Logical Page N+1, not physically resident---------------------------- | instruction logically following dbra (via PC increment) --------------------------------------------------------------------- We discover that a Bus Error will be generated by the PMMU (or other hardware). However, a Bus Error exception *WILL NOT* be taken at this time; rather the valid bit in the tag field of the cache line will be cleared. So we can ignore exception processing as a source of overhead. Also, since the execution unit and the bus controller are pretty well decoupled, and neither of the branch instructions will generate any bus traffic; the branch will complete and the sequencer will issue the next instruction (the top of the loop) while the prefetch to non-resident memory is taking place; IFF the branch is < 256 bytes (we'll discuss the other case later). Thus, the prefetch does not stall the pipe, so (counting cycles) we find from the manual that the relevant instructions times are: Best Case Cache Case Worst Case subq.w #1,Dn 0 2 3 bne Loop (taken) 3 6 9 vs. dbra Dn,Loop 3 6 9 It is much more likely that the dbra instruction can take advantage of overlap due to bus activity from a previous instruction, since any overlap in the first case would really show up during the subq instruction (and the I-pipe is only 3 words deep). Thus for cases where the loop fits entirely in cache, I would expect case #2 to be at least as fast as #1, perhaps faster. In those cases where the loop *DOES NOT* fit entirely in cache, we will have additional latency for both cases because we will wait for the prefetch cycle to complete (via the *BERR signal) prior to initiating the fetch for the instruction at the top of the loop. Note that in this example we will get a page fault inside the loop for case #1, rather than after the loop as in case #2. Here I would really expect #1 to be slower, as we *WILL* do Bus Error exception processiong. Also note that *BOTH* instructions have prefetch, not just dbra. > > because the next instruction following the 'dbra' instruction > is always prefetched by the 68020 (and never cached). > Prefetching is always occurring on the 68020 (assuming memory bandwidth is available), except when you use the 'sync' instruction (officially known as NOP). And if a valid access to memory occurs, the instruction is cached, provided the cache is enabled both by hardware and software. An additional case where #1 may be faster is where the branch displacement is such that we can use a byte (seven bits plus sign) displacement value. But this was not suggested by your example. > Can someone shed some light on this? > Hopefully, this helped. Regards, -- Ron Widell, Field Applications Eng. |UUCP: {...}mcdchg!motmpl!ron Motorola Semiconductor Products, Inc., |Voice:(612)941-6800 9600 W. 76th St., Suite G | I'm from Silicon Tundra, Eden Prairie, Mn. 55344 -3718 | what could I know?