Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!rex!uflorida!gatech!mcnc!rti!dg-rtp!siberia!hamilton From: hamilton@siberia.rtp.dg.com (Eric Hamilton) Newsgroups: comp.sys.m88k Subject: Read/write/execute Proposal Message-ID: <1990Dec21.201522.16487@dg-rtp.dg.com> Date: 21 Dec 90 20:15:22 GMT Sender: usenet@dg-rtp.dg.com (Usenet Administration) Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton) Organization: Data General Corporation, Research Triangle Park, NC Lines: 151 In a previous posting, Jim Klingshirn asked about the need for user instructions to generate cache control operations. I argued that a better and more implementable solution is to supply a user trap, fielded by the operating system, which will apply the appropriate cache operations. I did not discuss the details of how such a trap would be used, nor what the implications for virtual memory and multiple processors would be. In other words, I assiduously avoided the problem of code that is dynamically generated or moved, the problem that Sean Foderaro refers to as "read-write-execute" data. This posting proposes a solution to that problem. The reader should refer to Foderaro's posting for a discussion of why the existing memctl() function is not a solution to this problem. [In this context, an "Icache" is an instruction CMMU and a "Dcache" is a data CMMU.] The problem: In general, 88000 icaches do not snoop and instruction fetches are not marked global. Note the phrase "in general"; some vendors may build systems in which either or both of these statements are not true, but in general we shouldn't require that all current and future 88000s support these capabilities, which are not free. Thus, the icaches are not generally coherent when instructions in memory are changed. There are two relevant ways in which instructions can change. First, there is the normal activity of the kernel virtual memory manager, which pages code in as necessary, loads code from disk or across the net, and generally moves instructions around in page-sized chunks. Second, user programs may want to modify a region of memory and then try to execute it. This is a legitimate functional requirement; see the postings by Sean Foderaro, Piercarlo Grandi, David Benjamin, and others for examples. Because the icaches are not coherent, whenever any executable memory is changed, it is necessary to invalidate (some portion of) the icache and force (some portion of) the dcache to copyback modified data. The invalidate eliminates stale data from the instruction cache; the copyback ensures that non-global instruction cache fills will not fetch stale data from memory. Multiprocessor examples: In a multiprocessor system it is generally necessary to invalidate/copyback all processors' caches. This is because the offending data may be stale in any processor's cache. For example: Process A on processor 1 pages the instructions a1 into page p1. Process A executes a1 on processor 0. Process A is rescheduled onto processor 2 and executes a1. Process A is rescheduled onto processor 3 and executes a1. Process A terminates. Process B on processor 2 page faults on non-resident instruction b1. The virtual memory manager decides to fill page p1 with b1. A network demon decides to fetch b1 from a remote executable. The network demon is scheduled onto processor 1. The network demon starts copying b1 into p1. The network demon is rescheduled onto processor 3. The network demon is rescheduled onto processor 2. The network demon completes the pagein. Process B now resumes execution. At this point, the instruction caches of processors 1, 2, and 3 incoherently believe that page p1 contains a1, as does main memory. The correct value, b1, can be found in the data cache of one or more of the processors on which the network demon executed. It should be obvious that every icache must invalidate and every dcache must copyback page p1 before B can safely resume execution. Similar examples can be constructed for user-modified code, even if only one process is involved. Indeed, the problem is the same: the offending writes, punctuated by reschedules, may have occurred on several processors, instruction fetches may have occurred on several processors, all of the instruction caches may be more or less incoherent and the correct data may be scattered through several data caches. What this means for rwx pages: There are two conclusions that follow from the discussion above. 1) The operating system must know, at page replacement time, that a particular page is potentially executable, and if it is is, must issue to every processor a dcache copyback and an icache invalidaate for that page. 2) If a user tries to execute data, a dcache copyback and an icache invalidate must be issued to every processor for the data area in question, after the data is modified and before it is executed. This is exactly the same operation that the operating system must perform during page replacement, so it is trivially true that the necessary hardware support is present. This is not a problem for the operating system, which has more or less direct access to the cache hardware and controls page replacement. It is a problem for user rwx pages, for two reasons. First, user code has no direct access to the cache hardware. Second, the OS virtual memory manager must somehow be notified that a data page is potentially executable, so that it can page it in correctly. If these two problems can be overcome, there is no reason why read-write-execute pages cannot be made to work on any current or future 88000 processors, in uni-processor or multi-processor systems. Proposal: I propose that we support read-write-execute pages by defining mechanisms that user applications may invoke to identify potentially executable data and to provoke cache writebacks and invalidates as necessary. I have already proposed a cache manipulation operation in a previous posting to comp.sys.m88k: > > r2 contains the base address > r3 contains the length > > tb0 0,r0, > > Will cause the data and instruction caches for the specified region (between > r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, > so that that region can be safely executed. > If any byte within a four-byte word in this region is written, > the the subsequent execution of that word is > undefined until another CacheSynchonizationTrap that covers that word is > issued. A length of zero is interpreted to mean all memory. > We also need some way to notify the kernel that a piece of storage is potentially executable. The following mechanisms come to mind: - Add a MCT_RWX (state 4) argument to memctl(). When an area is memctl'd to MCT_RWX the operating system must treat it as potentialy executable for paging purposes. This is probably the solution of choice in the BCS world. - Use mprotect() in the V.4 world for the same purpose. - Add bits to the executable format to indicate that stack extensions and/or sbrk() extensions should be treated as potentially executable. This would be done as well as, not instead of, the memctl/mprotect thing. Note that the MCT_RWX memctl operation has exactly the interface, but not the semantics, proposed by Foderaro. It does not necessarily do any cache manipulation at all; it merely notifies the virtual memory manager that some pageins will in the future require special treatment. For example, a LISP interpreter might choose to use the MCT_RWX memctl() option to mark its entire heap as read-write-execute. This would be done once. Whenever code was dynamically compiled into the heap, and whenever code was moved by the garbage collector, the CacheSynchronizationTrap would be issued by the application to bring the instruction caches back into coherence. Whenever the virtual memory manager paged any part of the heap, it would recognize the read-write-execute state and properly invalidate the instruction caches.