Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!samsung!aplcen!mef
From: mef@aplcen.apl.jhu.edu (Marty Fraeman)
Newsgroups: comp.arch
Subject: Re: Software modularity vs. instruction locality
Message-ID: <3887@aplcen.apl.jhu.edu>
Date: 15 Nov 89 16:10:18 GMT
References: <17707@watdragon.waterloo.edu> <23604@cup.portal.com> <6374@dime.cs.umass.edu>
Reply-To: mef@aplcen (Marty Fraeman)
Distribution: na
Organization: Johns Hopkins University
Lines: 49

In article <6374@dime.cs.umass.edu> shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) writes:
>In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) 
>writes:
>>In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote:
>>>                                     Second, I would expect better locality
>>> for code reference than for data reference, hence the I cache ought to do
>>> more good than the D cache.  Aren't the pathological cache-busting programs
>>> generally ones which spray data accesses all over the place?
>>
>>Not necessarily. There's a subtle problem here; good software modularity
>>practices tend to hurt code locality. If you're calling subroutines a lot
>>in generated code the PC jumps all over the shop.
>
>This happens for example in a FORTH machine, FORTH typically is
>subroutine threaded, so there is a flurry of subroutine calls
>happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016 
>Forth CPU).
>
>In forth there is a subroutine call every five or so instructions
>I would guess.
We have looked at a similar issue in Forth.  Over 90% of sequential
code accesses are less than 6.25 instructions long on the SC32 Forth
engine.  This machine can execute most Forth primitives with a single
one cycle instruction except for load and store which take two cycles.
Subroutine calls are one cycle and most returns take zero cycles since
they can generally be combined with another instruction.  

We also looked at the effectiveness of instruction caches on this 
machine and found that fairly small caches (<16KB) could still achieve 
>95% hit rates.  However, since the size of the programs we studied 
was fairly modest our I-cache size result should be taken with a grain 
of salt.  On the other hand the size of programs we studied was 
comparable to the size of single threads on typical real-time 
applications we've developed in the past so I believe there is some 
significance to our data.

As a final comment on I-cache effectiveness in Forth, keep in mind that
while Forth instruction traces hop all over the place the hierachical
nature of most Forth implementations keeps code size much smaller than
usual.

	Marty Fraeman

	mef@aplcen.apl.jhu.edu
	301-953-5000, x8360

	JHU/Applied Physics Laboratory
	Johns Hopkins Road
	Laurel, Md. 20707