Path: utzoo!attcan!uunet!husc6!think!ames!pasteur!ucbvax!decwrl!decvax!tektronix!orca!tekecs!frip!andrew
From: andrew@frip.gwd.tek.com (Andrew Klossner)
Newsgroups: comp.arch
Subject: Re: Details on Moto 88200 CMMU
Message-ID: <10067@tekecs.TEK.COM>
Date: 12 Jun 88 00:35:41 GMT
References: <6396@cup.portal.com>
Sender: andrew@tekecs.TEK.COM
Organization: Tektronix, Wilsonville, Oregon
Lines: 72

[]

	"The cache is 4-way set-associative, so most thrashing problems
	are eliminated. When multiple CMMU chips are used in parallel,
	they effectively increase the associativity of the cache; two
	chips in parallel act as an 8-way set-associative cache."

No, it's still 4-way set-associative.  Since exactly one CMMU will see
any one page reference, a set hit will still flush one of the 4
existing set elements.

	"Even when copy-back is selected, the first write to a cached
	location is written through. This updates the main memory and
	invalidates any other copies of the data that may be present in
	other caches, to ensure that no more than one cache contains a
	modified version of the data."

It writes back the entire 16-byte cache line, regardless of how much of
it was modified.  And it does so even if the page is marked "not
global," meaning that software guarantees that the cache line is not
present in any other cache.  Ouch!

	"Because the snooping logic requires access to the cache tags
	whenever an M-bus access to a global location occurs, normal
	operation of the cache is affected. If a P-bus request occurs
	while the snooping logic is checking the tags, a wait state is
	generated. This causes some drop in performance in
	multiprocessor systems... "

Every snooped access causes all CMMUs to stop servicing the CPU(s)
while they do the tag check.  Snooping is *very* expensive.  You only
use it for small pieces of memory shared among CPUs.

	"Note that adding Cmmus increases the size of the PATC and BATC
	as well as the data/instruction cache."

Adding CMMUs creates some interesting problems.  With more than one
CMMU on a memory port (instruction or data), you pretty much have to
use part of the memory address as a chip selector.  (We use A12 and A13
to select one of four CMMUs.)  But this memory address is *virtual*, so
suddenly we have a (somewhat) virtual cache with aliasing problems.
For example, if physical page 12 is mapped to virtual page 16, it is
serviced by CMMU0; if the kernel remaps it to virtual page 17, it is
serviced by CMMU1.  The aliasing problems can be solved by snooping all
of memory, but this is prohibitively expensive.  The kernel can flush a
page from cache when freeing it, but a cache page flush takes a minimum
of 256 cycles.  Two solutions that we're looking at are:

  1) make the "software page cluster size" be 4 pages; that is, always
     allocate 4 contiguous physical pages together.  This turns it back
     into a physical cache.  On the downside, this makes for higher
     internal fragmentation and wastes three-fourths of the PATC.

  2) Maintain four separate lists of free pages, one for each of the
     four values of <A13:A12>, and allocate physical pages so that
     a page is always serviced by the same CMMU.  When there are no
     free pages in the right list to back a new virtual page, allocate
     a page from some other list, and flush it from the old CMMU.
     Credit for this idea goes to the Motorola Unix kernel group in
     Tempe, Arizona.

Note that, when you want to enlarge a cache, you end up buying multiple
MMUs to go with your additional RAM.  This is pretty pricey, but it can
provide well scheduled software with additional opportunities for
parallelism: since memory loads and stores are pipelined, a load from
one CMMU can wait on a page table walk while a load from a second CMMU
can be serviced from a cache hit.

On the whole, it's a neat part.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]