Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!rex!uflorida!gatech!mcnc!rti!dg-rtp!siberia!hamilton
From: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Newsgroups: comp.sys.m88k
Subject: Read/write/execute Proposal
Message-ID: <1990Dec21.201522.16487@dg-rtp.dg.com>
Date: 21 Dec 90 20:15:22 GMT
Sender: usenet@dg-rtp.dg.com (Usenet Administration)
Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Organization: Data General Corporation, Research Triangle Park, NC
Lines: 151

In a previous posting, Jim Klingshirn asked about the need for
user instructions to generate cache control operations.  I
argued that a better and more implementable solution is to
supply a user trap, fielded by the operating system, which
will apply the appropriate cache operations.  I did not discuss
the details of how such a trap would be used, nor what the
implications for virtual memory and multiple
processors would be.  In other words, I assiduously avoided the
problem of code that is dynamically generated or moved, the problem
that Sean Foderaro refers to as "read-write-execute" data.
This posting proposes a solution to that problem.  The reader
should refer to Foderaro's posting for a discussion of why the
existing memctl() function is not a solution to this problem.

[In this context, an "Icache" is an instruction CMMU and a "Dcache"
 is a data CMMU.]

The problem:
In general, 88000 icaches do not snoop and instruction fetches are not
marked global.  Note the phrase "in general"; some vendors may build
systems in which either or both of these statements are not true, but in
general we shouldn't require that all current and future 88000s support
these capabilities, which are not free.  Thus, the icaches are not
generally coherent when instructions in memory are changed.

There are two relevant ways in which instructions can change.
First, there is the normal activity of the kernel virtual memory manager,
which pages code in as necessary, loads code from disk or across the
net, and generally moves instructions around in page-sized chunks.
Second, user programs may want to modify a region of memory and then try to
execute it.  This is a legitimate functional requirement; see the postings
by Sean Foderaro, Piercarlo Grandi, David Benjamin, and others for examples.

Because the icaches are not coherent, whenever any executable memory is
changed, it is necessary to invalidate (some portion of) the icache
and force (some portion of) the dcache to copyback modified data.  The
invalidate eliminates stale data from the instruction cache; the copyback
ensures that non-global instruction cache fills will not fetch stale
data from memory.


Multiprocessor examples:
In a multiprocessor system it is generally necessary to invalidate/copyback
all processors' caches.  This is because the offending data may be stale
in any processor's cache.  For example:

	Process A on processor 1 pages the instructions a1 into page p1.
	Process A executes a1 on processor 0.
	Process A is rescheduled onto processor 2 and executes a1.
	Process A is rescheduled onto processor 3 and executes a1.
	Process A terminates.
	Process B on processor 2  page faults on non-resident instruction b1.
	The virtual memory manager decides to fill page p1 with b1.
	A network demon decides to fetch b1 from a remote executable.
	The network demon is scheduled onto processor 1.
	The network demon starts copying b1 into p1.
	The network demon is rescheduled onto processor 3.
	The network demon is rescheduled onto processor 2.
	The network demon completes the pagein.
	Process B now resumes execution.

At this point, the instruction caches of processors 1, 2, and 3 incoherently
believe that page p1 contains a1, as does main memory.  The correct value,
b1, can be found in the data cache of one or more of the processors on which
the network demon executed.  It should be obvious that every icache must
invalidate and every dcache must copyback page p1 before B can safely
resume execution.

Similar examples can be constructed for user-modified code, even if
only one process is involved.  Indeed, the problem is the
same: the offending writes, punctuated by reschedules, may have occurred
on several processors, instruction fetches may have occurred on
several processors, all of the instruction caches may be more or
less incoherent and the correct data may be scattered through
several data caches.


What this means for rwx pages:
There are two conclusions that follow from the discussion above.

1) The operating system must know, at page replacement time, that a
   particular page is potentially executable, and if it is is, must
   issue to every processor a dcache copyback and an icache invalidaate
   for that page.

2) If a user tries to execute data, a dcache copyback and an icache
   invalidate must be issued to every processor for the data area in
   question, after the data is modified and before it is executed.
   This is exactly the same operation that the operating system
   must perform during page replacement, so it is trivially true that the
   necessary hardware support is present.

This is not a problem for the operating system, which has more or less
direct access to the cache hardware and controls page replacement.
It is a problem for user rwx pages, for two reasons.  First,
user code has no direct access to the cache hardware.  Second, the
OS virtual memory manager must somehow be notified that a data page
is potentially executable, so that it can page it in correctly.

If these two problems can be overcome, there is no reason why
read-write-execute pages cannot be made to work on any current or
future 88000 processors, in uni-processor or multi-processor systems.

Proposal:
I propose that we support read-write-execute pages by defining mechanisms
that user applications may invoke to identify potentially executable
data and to provoke cache writebacks and invalidates as necessary.

I have already proposed a cache manipulation operation in a previous
posting to comp.sys.m88k:
>
>	r2 contains the base address
>	r3 contains the length
>
>	tb0 0,r0,<CacheSynchronizationTrap>
>
> Will cause the data and instruction caches for the specified region (between
> r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
> so that that region can be safely executed.
> If any byte within a four-byte word in this region is written,
> the the subsequent execution of that word is
> undefined until another CacheSynchonizationTrap that covers that word is
> issued.  A length of zero is interpreted to mean all memory.
>

We also need some way to notify the kernel that a piece of storage is
potentially executable.  The following mechanisms come to mind:

	- Add a MCT_RWX (state 4) argument to memctl().  When an area
	  is memctl'd to MCT_RWX the operating system must treat it as
	  potentialy executable for paging purposes.  This is probably
	  the solution of choice in the BCS world.
	- Use mprotect() in the V.4 world for the same purpose.
	- Add bits to the executable format to indicate that stack extensions
	  and/or sbrk() extensions should be treated as potentially
	  executable.  This would be done as well as, not instead of, the
	  memctl/mprotect thing.

Note that the MCT_RWX memctl operation has exactly the interface, but not
the semantics, proposed by Foderaro.  It does not necessarily do any
cache manipulation at all; it merely notifies the virtual memory manager
that some pageins will in the future require special treatment.

For example, a LISP interpreter might choose to use the MCT_RWX memctl()
option to mark its entire heap as read-write-execute.  This would be
done once.  Whenever code was dynamically compiled into the heap,
and whenever code was moved by the garbage collector, the
CacheSynchronizationTrap would be issued by the application to bring
the instruction caches back into coherence.  Whenever the virtual
memory manager paged any part of the heap, it would recognize the
read-write-execute state and properly invalidate the instruction caches.