Path: utzoo!attcan!uunet!husc6!think!ames!pasteur!ucbvax!decwrl!decvax!tektronix!orca!tekecs!frip!andrew From: andrew@frip.gwd.tek.com (Andrew Klossner) Newsgroups: comp.arch Subject: Re: Details on Moto 88200 CMMU Message-ID: <10067@tekecs.TEK.COM> Date: 12 Jun 88 00:35:41 GMT References: <6396@cup.portal.com> Sender: andrew@tekecs.TEK.COM Organization: Tektronix, Wilsonville, Oregon Lines: 72 [] "The cache is 4-way set-associative, so most thrashing problems are eliminated. When multiple CMMU chips are used in parallel, they effectively increase the associativity of the cache; two chips in parallel act as an 8-way set-associative cache." No, it's still 4-way set-associative. Since exactly one CMMU will see any one page reference, a set hit will still flush one of the 4 existing set elements. "Even when copy-back is selected, the first write to a cached location is written through. This updates the main memory and invalidates any other copies of the data that may be present in other caches, to ensure that no more than one cache contains a modified version of the data." It writes back the entire 16-byte cache line, regardless of how much of it was modified. And it does so even if the page is marked "not global," meaning that software guarantees that the cache line is not present in any other cache. Ouch! "Because the snooping logic requires access to the cache tags whenever an M-bus access to a global location occurs, normal operation of the cache is affected. If a P-bus request occurs while the snooping logic is checking the tags, a wait state is generated. This causes some drop in performance in multiprocessor systems... " Every snooped access causes all CMMUs to stop servicing the CPU(s) while they do the tag check. Snooping is *very* expensive. You only use it for small pieces of memory shared among CPUs. "Note that adding Cmmus increases the size of the PATC and BATC as well as the data/instruction cache." Adding CMMUs creates some interesting problems. With more than one CMMU on a memory port (instruction or data), you pretty much have to use part of the memory address as a chip selector. (We use A12 and A13 to select one of four CMMUs.) But this memory address is *virtual*, so suddenly we have a (somewhat) virtual cache with aliasing problems. For example, if physical page 12 is mapped to virtual page 16, it is serviced by CMMU0; if the kernel remaps it to virtual page 17, it is serviced by CMMU1. The aliasing problems can be solved by snooping all of memory, but this is prohibitively expensive. The kernel can flush a page from cache when freeing it, but a cache page flush takes a minimum of 256 cycles. Two solutions that we're looking at are: 1) make the "software page cluster size" be 4 pages; that is, always allocate 4 contiguous physical pages together. This turns it back into a physical cache. On the downside, this makes for higher internal fragmentation and wastes three-fourths of the PATC. 2) Maintain four separate lists of free pages, one for each of the four values of , and allocate physical pages so that a page is always serviced by the same CMMU. When there are no free pages in the right list to back a new virtual page, allocate a page from some other list, and flush it from the old CMMU. Credit for this idea goes to the Motorola Unix kernel group in Tempe, Arizona. Note that, when you want to enlarge a cache, you end up buying multiple MMUs to go with your additional RAM. This is pretty pricey, but it can provide well scheduled software with additional opportunities for parallelism: since memory loads and stores are pipelined, a load from one CMMU can wait on a page table walk while a load from a second CMMU can be serviced from a cache hit. On the whole, it's a neat part. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]