Path: utzoo!attcan!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!usc!rutgers!mcnc!rti!dg-rtp!siberia!hamilton
From: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Newsgroups: comp.sys.m88k
Subject: Re: m88200 cache flushes on DG Aviion
Keywords: m88200 Aviion
Message-ID: <1990Dec18.154507.28370@dg-rtp.dg.com>
Date: 18 Dec 90 15:45:07 GMT
References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP>
Sender: usenet@dg-rtp.dg.com (Usenet Administration)
Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Organization: Data General Corporation, Research Triangle Park, NC
Lines: 177

Says Jim Klingshirn:
|> 
|> We periodically receive requests to add user level instructions
|> to generate cache control operations.  The justification generally
|> is centered around data caches - for instance how can you force data
|> out so that it is guaranteed to get out to the graphics frame buffer 
|> when the cache is in copyback mode.  To date, the only justification
|> I've heard for instruction cache control  is to support self 
|> modifying code (including breakpoints).  Assuming there are alternate
|> ways to support breakpoints - should we worry about instruction cache
|> control operations?
|> 
[Attributions below are based on other postings in this thread]

Yes, there is a need for instruction cache control operations,
and the need extend far beyond breakpoints (Piercarlo Grandi,
Dan Pierson, Dave Benjamin, and others).
Furthermore, the current memctl() call is
not a good answer, for two reasons.  First, it is slow, at least
in some implementations.  Second, as John Foderaro
points out, it has the wrong semantics for some applications, such as
Lisp garbage collection.

On multi-processor systems it is generally necessary to apply
the cache operation to all processors in the system (Alan
Langerman, with the approval of MP OS folks everywhere).
A user-level instruction is a very bad way
to implement such a feature, because it would require a path that
allows an instruction executed on one processor to affect the caches
of another processor in a way that can be safely used by multiple
processors simultaneously from user space without any locking or coordination
between each other or the OS.  A hardware implementation of this is
completely unreasonable (think about how two processors would simultaneously
issue page flush operations on different pages, and wait for the flushes to
complete, from user space, without deadlocking or tromping over each other's
commands).  And if some software support is best, then
the obvious way to get this support is to trap to the operating system,
which can make the right thing happen across all processors in the system.

Thus, the problem is not how to get the right support for user cache operations
into the hardware.  The problem is to devise an OS trap which is fast enough
and has the right semantics.  Additional support from the 88000 hardware will
be required only if we cannot implement an OS trap with acceptable functionality
and performance.  This suggests two questions:

1) What is "acceptable functionality"?
2) What is "acceptable performance"?

Functionality:  I believe that John Foderaro's proposal to extend memctl()
with an option to bring the data and instruction caches into coherence
is excellent, but it doesn't go quite far enough.  We shouldn't use memctl()
because: 1) It requires that the length be a multiple of the page size, and
I'd like to be able to use the line-granular cache operations when possible;
and 2) It's a system call, system call entry/exit overhead is appreciable,
and I'd like the trap to be as fast as possible.  How about a new trap:

	r2 contains the base address
	r3 contains the length

	tb0 0,r0,<CacheSynchronizationTrap>

Will cause the data and instruction caches for the specified region (between
r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
so that that region can be safely executed.
If any byte within a four-byte word in this region is written,
the the subsequent execution of that word is
undefined until another CacheSynchonizationTrap that covers that word is
issued.  A length of zero is interpreted to mean all memory.

No error checking is necessary.  If the region contains invalid addresses,
nothing bad happens; the copyback/invalidate just becomes moot.

Performance:  The execution time for this trap will, of course, vary
according to the details of the system.  It would not be surprising to
discover that it takes twice as long on a system with two 88200s per
Pbus, for example.  Copyback times obviously will vary according to
the number of dirty lines that must be copied back.  Invalidating parts of
the instruction cache has a performance impact that goes beyond the time
required to the invalidation.  We cannot control this time, but we can control
the overhead required to get into and out of the cache control operation.
How do people feel about a target of 10 clock cycles overhead?
How about 100? 200? 2000?

It is my belief that this trap can be made blindingly fast and that
the overhead will be small compared with the actual cost of doing the
cache manipulation.


How do people feel about this approach?  It is it promising enough to
justify the work of drafting a proposal and submitting it to 88Open?


Path: siberia!hamilton
Newsgroups: comp.sys.m88k
Distribution: world
Followup-To: 
References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP>
From: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Organization: Data General Corporation, Research Triangle Park, NC
Subject: Re: m88200 cache flushes on DG Aviion
Keywords: m88200 Aviion

In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes:

|> 
|> We periodically receive requests to add user level instructions
|> to generate cache control operations.  The justification generally
|> is centered around data caches - for instance how can you force data
|> out so that it is guaranteed to get out to the graphics frame buffer 
|> when the cache is in copyback mode.  To date, the only justification
|> I've heard for instruction cache control  is to support self 
|> modifying code (including breakpoints).  Assuming there are alternate
|> ways to support breakpoints - should we worry about instruction cache
|> control operations?
|> 
[Attributions below are based on other postings in this thread]


Yes, there is
a need for instruction cache control operations, and the need
extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin,
and others).  Furthermore, the current memctl() call is
not a good answer, for two reasons.  First, it is slow, at least
in some implementations.  Second, as John Foderaro
points out, it has the wrong semantics for some applications, such as
Lisp garbage collection.

On multi-processor
systems it is generally necessary to apply the cache operation to
all processors in the system (Alan Langerman, with the approval of
MP OS folks everywhere).  A user-level instruction is a very bad way
to implement such a feature, because it would require a path that
allows an instruction executed on one processor to affect the caches
of another processor in a way that can be safely used by multiple
processors simultaneously from user space without any locking or coordination
between each other or the OS.  A hardware implementation of this is
completely unreasonable (think about how two processors would simultaneously
issue page flush operations on different pages, and wait for the flushes to
complete, from user space, without deadlocking or tromping over each other's
commands).  And if some software support is required, then
the obvious way to get this support is to trap to the operating system,
which can make the right thing happen across all processors in the system.

Thus, the problem is not how to get the right support for user cache operations
into the hardware.  The problem is to devise an OS trap which is fast enough
and has the right semantics.  Additional support from the 88000 hardware will
be required only if we cannot implement an OS trap with acceptable functionality
and performance.  This suggests two questions:

1) What is "acceptable functionality"?
2) What is "acceptable performance"?

Functionality:  I believe that John Foderaro's proposal to extend memctl()
with an option to bring the data and instruction caches into coherence
is excellent, but it doesn't go quite far enough.  We shouldn't use memctl()
because: 1) It requires that the length be a multiple of the page size, and
I'd like to be able to use the line-granular cache operations when possible;
and 2) It's a system call, system call entry/exit overhead is appreciable,
and I'd like the trap to be as fast as possible.  How about a new trap:

	r2 contains the base address
	r3 contains the length

	tb0 0,r0,<CacheSynchronizationTrap>

Will cause the data and instruction caches for the specified region (between
r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
so that that region can be safely executed.  If any byte within a (four byte) word
in this region is written, the the subsequent execution of that word is
undefined until another CacheSynchonizationTrap that covers that word is
issued.  A length of zero is interpreted to mean all memory.

No error checking is necessary.  If the region contains invalid addresses,
nothing bad happens and the cache synchronization proves to be very