Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!usc!rutgers!mcnc!rti!dg-rtp!siberia!hamilton From: hamilton@siberia.rtp.dg.com (Eric Hamilton) Newsgroups: comp.sys.m88k Subject: Re: m88200 cache flushes on DG Aviion Keywords: m88200 Aviion Message-ID: <1990Dec18.154507.28370@dg-rtp.dg.com> Date: 18 Dec 90 15:45:07 GMT References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP> Sender: usenet@dg-rtp.dg.com (Usenet Administration) Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton) Organization: Data General Corporation, Research Triangle Park, NC Lines: 177 Says Jim Klingshirn: |> |> We periodically receive requests to add user level instructions |> to generate cache control operations. The justification generally |> is centered around data caches - for instance how can you force data |> out so that it is guaranteed to get out to the graphics frame buffer |> when the cache is in copyback mode. To date, the only justification |> I've heard for instruction cache control is to support self |> modifying code (including breakpoints). Assuming there are alternate |> ways to support breakpoints - should we worry about instruction cache |> control operations? |> [Attributions below are based on other postings in this thread] Yes, there is a need for instruction cache control operations, and the need extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin, and others). Furthermore, the current memctl() call is not a good answer, for two reasons. First, it is slow, at least in some implementations. Second, as John Foderaro points out, it has the wrong semantics for some applications, such as Lisp garbage collection. On multi-processor systems it is generally necessary to apply the cache operation to all processors in the system (Alan Langerman, with the approval of MP OS folks everywhere). A user-level instruction is a very bad way to implement such a feature, because it would require a path that allows an instruction executed on one processor to affect the caches of another processor in a way that can be safely used by multiple processors simultaneously from user space without any locking or coordination between each other or the OS. A hardware implementation of this is completely unreasonable (think about how two processors would simultaneously issue page flush operations on different pages, and wait for the flushes to complete, from user space, without deadlocking or tromping over each other's commands). And if some software support is best, then the obvious way to get this support is to trap to the operating system, which can make the right thing happen across all processors in the system. Thus, the problem is not how to get the right support for user cache operations into the hardware. The problem is to devise an OS trap which is fast enough and has the right semantics. Additional support from the 88000 hardware will be required only if we cannot implement an OS trap with acceptable functionality and performance. This suggests two questions: 1) What is "acceptable functionality"? 2) What is "acceptable performance"? Functionality: I believe that John Foderaro's proposal to extend memctl() with an option to bring the data and instruction caches into coherence is excellent, but it doesn't go quite far enough. We shouldn't use memctl() because: 1) It requires that the length be a multiple of the page size, and I'd like to be able to use the line-granular cache operations when possible; and 2) It's a system call, system call entry/exit overhead is appreciable, and I'd like the trap to be as fast as possible. How about a new trap: r2 contains the base address r3 contains the length tb0 0,r0, Will cause the data and instruction caches for the specified region (between r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, so that that region can be safely executed. If any byte within a four-byte word in this region is written, the the subsequent execution of that word is undefined until another CacheSynchonizationTrap that covers that word is issued. A length of zero is interpreted to mean all memory. No error checking is necessary. If the region contains invalid addresses, nothing bad happens; the copyback/invalidate just becomes moot. Performance: The execution time for this trap will, of course, vary according to the details of the system. It would not be surprising to discover that it takes twice as long on a system with two 88200s per Pbus, for example. Copyback times obviously will vary according to the number of dirty lines that must be copied back. Invalidating parts of the instruction cache has a performance impact that goes beyond the time required to the invalidation. We cannot control this time, but we can control the overhead required to get into and out of the cache control operation. How do people feel about a target of 10 clock cycles overhead? How about 100? 200? 2000? It is my belief that this trap can be made blindingly fast and that the overhead will be small compared with the actual cost of doing the cache manipulation. How do people feel about this approach? It is it promising enough to justify the work of drafting a proposal and submitting it to 88Open? Path: siberia!hamilton Newsgroups: comp.sys.m88k Distribution: world Followup-To: References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP> From: hamilton@siberia.rtp.dg.com (Eric Hamilton) Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton) Organization: Data General Corporation, Research Triangle Park, NC Subject: Re: m88200 cache flushes on DG Aviion Keywords: m88200 Aviion In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes: |> |> We periodically receive requests to add user level instructions |> to generate cache control operations. The justification generally |> is centered around data caches - for instance how can you force data |> out so that it is guaranteed to get out to the graphics frame buffer |> when the cache is in copyback mode. To date, the only justification |> I've heard for instruction cache control is to support self |> modifying code (including breakpoints). Assuming there are alternate |> ways to support breakpoints - should we worry about instruction cache |> control operations? |> [Attributions below are based on other postings in this thread] Yes, there is a need for instruction cache control operations, and the need extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin, and others). Furthermore, the current memctl() call is not a good answer, for two reasons. First, it is slow, at least in some implementations. Second, as John Foderaro points out, it has the wrong semantics for some applications, such as Lisp garbage collection. On multi-processor systems it is generally necessary to apply the cache operation to all processors in the system (Alan Langerman, with the approval of MP OS folks everywhere). A user-level instruction is a very bad way to implement such a feature, because it would require a path that allows an instruction executed on one processor to affect the caches of another processor in a way that can be safely used by multiple processors simultaneously from user space without any locking or coordination between each other or the OS. A hardware implementation of this is completely unreasonable (think about how two processors would simultaneously issue page flush operations on different pages, and wait for the flushes to complete, from user space, without deadlocking or tromping over each other's commands). And if some software support is required, then the obvious way to get this support is to trap to the operating system, which can make the right thing happen across all processors in the system. Thus, the problem is not how to get the right support for user cache operations into the hardware. The problem is to devise an OS trap which is fast enough and has the right semantics. Additional support from the 88000 hardware will be required only if we cannot implement an OS trap with acceptable functionality and performance. This suggests two questions: 1) What is "acceptable functionality"? 2) What is "acceptable performance"? Functionality: I believe that John Foderaro's proposal to extend memctl() with an option to bring the data and instruction caches into coherence is excellent, but it doesn't go quite far enough. We shouldn't use memctl() because: 1) It requires that the length be a multiple of the page size, and I'd like to be able to use the line-granular cache operations when possible; and 2) It's a system call, system call entry/exit overhead is appreciable, and I'd like the trap to be as fast as possible. How about a new trap: r2 contains the base address r3 contains the length tb0 0,r0, Will cause the data and instruction caches for the specified region (between r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, so that that region can be safely executed. If any byte within a (four byte) word in this region is written, the the subsequent execution of that word is undefined until another CacheSynchonizationTrap that covers that word is issued. A length of zero is interpreted to mean all memory. No error checking is necessary. If the region contains invalid addresses, nothing bad happens and the cache synchronization proves to be very