Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!lll-winken!uunet!mcvax!prlb2!kulcs!bimandre
From: bimandre@kulcs.uucp (Andre Marien)
Newsgroups: comp.lang.prolog
Subject: Inline expansion versus threaded code
Message-ID: <1654@kulcs.kulcs.uucp>
Date: 28 Apr 89 06:55:14 GMT
Reply-To: bimandre@kulcs.UUCP ()
Organization: B.I.M., Belgium
Lines: 269

(This reply became rather long ...)

First, a summary of responses in chronological order.

: It seems obvious to me that threaded code will page much more.  Inline code
: will avoid calls to potentially far-away memory locations that require paging
: in memory.  ... Just my intuitions ...
: Bruce Krulwich


: You may want to look into Envos, a recent Xerox spin-off marketing Xerox
: AI technology.  They have ported Interlisp-D, which runs on Xerox
: D-machines (microcoded Bytecode), to the Sun platform by writing a
: Bytecode emulator.  They apparently get quite good performance (emulator
: fits in 64K cache) and extremely dense code (5Meg for a full system vs.
: 30-40Meg for similar Symbolics image).  Native code would be faster, but
: perhaps not significantly, and would cause very significant code
: expansion. ...
: Johannes A. G. M. Koomen


: I'm shure You know it, but however: E.Tick Memory Performance of Prolog
: Architectures, Kluwers.
: ... for very short, what is the essence of this book: the WAM is not bad,
: but its only pitfall is the great memory throughput because of
: shallow backtracking ... the primary problem is __PAGING__!!!
: ...
: The biggest problem is not inline or not but the abstract machine. If it 
: has inherent failures inline expansion will have no great effects. And
: there lies the difference between Prolog and other programming languages.
: For the WAM inline expansion is indeed very attractive for determinate
: parts of programs. But when shallow backtracking is present the savings of 
: the instruction fetches are neglectable. When designing a Prolog-machine
: the primary concern should be the locality of memory references, inline
: expansion should be seen as an improvement for some higly used code. If one
: considers big Prolog applications (eg TWAICE) one sees clearly that inline
: code can be used for the smallest parts only. 
: Ulrich Neumerkel


: Our experience with large Prolog and C programs for natural-language
: analysis, grammar analysis, simulation and speech understanding is
: that soon after your program starts paging (even on an empty 32MB
: Sun-4/280 with local SMD disk), you may as well forget about it. The
: problem is that all of those applications create or use very large
: relations over complex terms leading to essentially random access to
: most of program memory for Prolog or data area for C.
: ...
: It doesn't matter much to us that native code might be
: between 1.5 and 3 times faster than threaded code on small or
: well-behaved benchmarks, since native code would be too bulky to run
: at all!
: ...
: Fernando Pereira
:
: My figures for native vs. threaded code came from comparing Quintus
: Prolog with two commercial native code compilers for Prolog two years
: ago, and from discussions with Richard O'Keefe about Prolog
: compilation. Native code compilation techniques may have improved
: since then.
: The problem of access to rules is more complicated in parsers that
: do not use the rules directly but instead use some kind of parsing
: tables derived from the rules. Usually, the tables are much larger
: than the original rule set.
: ...
: Fernando Pereira


: When I was visiting SICS over the northern winter I ran the MTS natural
: language system, written by Xiuming Huang, under Sicstus Prolog using
: the bytecode emulator and the (new) native code system.  I have to agree
: that paging really kills the system.
: ...
: About half the code in MTS is complex dcg rules and about half is the
: lexicon (large sets of facts).  I'm not sure how the size and speed
: ratios compare with these two rather different types of code.  It may be
: important for optimization (eg, it might be best to compile one with
: native code and emulate the other).
: Locality is another important
: issue.  The working set of the lexicon code is related to the number of
: different words in the sentence/text being processed.  This is probably
: reasonably small, even though the fine grain locality is poor.  For the
: grammar rules, fine grain locality is probably better, but the quantity
: of code used overall is larger.
: Lee Naish


: Point 1: the optimised native code is about twice as fast as the
: threaded code, because it doesn't have the overhead of the jmp *(SIMPC)++
: instructions, and perhaps more importantly because it is possible to
: combine the effects of a series of instructions moving things around into
: a simpler series putting things directly in their final place.
: Point 2: the emulated instructions might have occupied 6 (e.g. SB Prolog)
: or 12 (e.g. Quintus Prolog) bytes.  A VAX version of the optimised native
: code takes on the close order of 60.  That's for just one example, but it
: was to some extent a favourable example.
: Point 3: the emulator for an abstract machine is likely to be very small.
: Emulated code just won't be paging the emulator, and it is likely to get
: much higher locality of code reference than native code, which is good
: news for caches.  The emulated instructions are small, so they are good
: news for data caches.  The emulated instructions flow through the data
: cache, which is bad news for data caches, because the real data have to
: go through the same cache.
: Point 4: emulated systems can be easier to port.
: It's all a trade off.
: Richard A. O'Keefe


: Eckouse's book on PDP-11 programming (mid 70's) contains a section on
: inline-vs.-threaded-vs-`knotted' code.  This is probably the best
: description that I have seen in print.  (`Knotted' code is his name
: for a cute optimization on threading).  He includes a single data
: point <basic, inline, threaded, knotted> for a `real' program.  The
: data point supports his claim that you sometimes get better
: performance from threaded code than from inline code.
: ...
: several observations:
: * For systems that rely heavily on caching, and particularly where
:   caching actions (e.g., prefetch) can hurt performance (e.g, bus
:   contention on a shared-memory multiprocessor), inlining is a loss on
:   most large functions.
: * Threading is harder to apply and probably less general.  It is of
:   obvious utility in interpreters and for iterations over a fixed
:   parameter list.  It is less obvious how to use threading for
:   collections of arbitrary functions.  It is still less obvious what
:   is `best' when the function must have two calling conventions (the
:   `most general' and the optimized calls).
: ...
: %X /u1/rrh/bib/rrh.d/peterd.ref
: %T An Architectural Trail to Threaded-Code Systems
: %A Peter M. Kogge
: %P 22-33
: %J COMP
: %V 15
: %N 3
: %D MAR 1982
: %K forth
: ...
: Pardo
: ...
:
: * Global register allocation.
: ...
: Pardo


: I think the first is the most relevant, the following, you might find usefull.
: 
: @InProceedings(clocksin85,
:         Key="Clocksin",
:         Author="W. F. Clocksin",
:         Title="Implementation Techniques for Prolog Databases",
:         BookTitle="Software: Practice and Experience",
:         Pages="1-8",
:         Year="1985")
:         @comment{Keywords= assert}
: ...
: Periklis Tsahageas

The basic reason why we believe native code generation is better than
threaded code, is speed. Virtual machines, base register adressing, optimization
analysis, caches all start with the principle of locality of code and
data reference. You find this in the rule 90% of time in 10% of the code,
the name 'working set' and others.
So even if the amount of code produced when generating native code
is a lot higher, the working set itself will be moderate.
Having a complete emulator in 64k as Koomen reports, and we assume Quintus
uses, is nice of course, if your performance is not hurt too much. You
can plug in a 12Mbyte board instead of a 4Mbyte board; doing the same with a 
processor 12 Mips instead of 4 Mips may neither be as easy nor as effective.

In EUUG '89 David Rosenthal writes about a related topic, observed performance,
and says that a workstation performance depends heavily on the network/disk
speed, which can be improved by increasing memory, and if all that behaves well,
than you will see the effect of lower cputime. People tend to use the systems
resources to the edge where trashing is going to occur. This gives unexpected
drops in performance if you just go a little further.
It is obvious that programs exist were the small increase in paging due to
larger code may just do that.

It is also true that for some programs, it happens with both native and
threaded code because the cause is the heap reference, or the local stack,
not code reference. A program used as benchmark and written by O'Keefe,
doing fft, would run fine with the query u_time(9) and trash with u_time(10).
No relation with the code size, obviously !
Of course, this is not new. See the remarks of Neumerkel, and the well known
book of Tick.

The plm_compiler, which has a non-benchmark size, runs happily with native code.
No thrashing. That should answer the 'well-behaved benchmarks' remark
of Pereira. It may come as a suprise, but benchmarks tend to be in favor of
threaded code ! Just think of those long repetitions in some benchmark suites,
all using the same emulator instructions over and over again, whereas
in native code it is a brandnew instruction each time ... .
There is a counterpart of course. Any small recursive benchmark predicate
will be in the cache of a 68020 without much trouble, for native code.
With a thread, you may gain more than 10% of speed by tuning the
emulator to append/3 (concat/3). Do walk-list and walk-list-recursive ring
a bell ? The first gives overrated timings for threaded code, the second
shows native code at its best.

A good point is made by Pereira and agreed upon by Naish. Both mention
the problem that prolog programs contain two very different kinds of code.
You have the 'real' prolog predicates, and you have the facts. These are
quit different beasts, and we agree almost completely with Naish that 
it is best to compile one with native code and emulate the other.
This is precisely what is possible with BIM_Prolog. We support this idea at
different levels. First of all, there is a distinction between dynamic
and static code (code which is modifiable or code which is not). There are
cases where this is sufficient. As both are code, we compile both,
static code to native code, dynamic code only to WAM.
For fast access, there is indexing on one user selectable argument for
dynamic code, and up to three for static code.
If dynamic code access time is crucial, hashing can be added,
which gives a drastic page improvement if this would have been
a problem with the normal indexing.
We could construct an artificial testcase were all of a sudden response
dropped below acceptable level, using the emulated code, which is about
the same density as threaded code, and using a lot of facts with multiple
arguments. Hashing removed the problem at once.
Conclusion : running a benchmark on different systems without using any tuning
towards the implementation does not show its capacitys.
There is a second point about which we have been trying to convince people
- with more or less with success - is the principle that data and
program are not the same, despite its theoretical intersting properties.
What we mean is that using assert/retract for global variable
simulation is bad. We know 'global variable' is a bad thing, but we have
both our feet on the ground : sometimes it is better to have them. Just
as goto's (never used longjump in C?). Therefore we have a set of predicate
to handle them efficiently. The wise can use it to their advantage and others
should never write prolog.
The third point is the use of a database on disk. This may be a standard
relational database or a special prolog database. If your database is big,
paging will eventually be as slow as disk accesses. Maybe your database
caching will be even better. We could easily construct cases where a trashing
program was significantly slower than using a special database.

As (prolog) compiler technology advances, less instructions will be needed.
And of course, if the code is to long, you can always use a kind of call
to an out-of-line routine, gaining nothing, losing nothing compared
with the thread. Currently, many WAM instructions can be compiled in just a few
instructions. When using mode declarations, code size can often be reduced
substantially. Of course, you have to put them in your programs to
notice the reduction in code size.
There will be a paper at ICLP 89 showing what the future may bring.
An item mentioned by Pardo, is only applicable when using native code :
global register allocation. WAM register alloaction is still not fully
solved, due to tradeoffs and the interaction with things like
shallow backtracking. Not many attemps for global register allocation
have been made. As can be seen in Lisp, it could be usefull.

O'Keefe says 'It's all a trade off'. With this, we agree. The size of
code and the speedups are easy to trade off. If you develop programs in
prolog which need to compete with C-coded programs, you need top speed.
The arguments about caching are not that clear, we doubt he has a point.
See above. As for the ease of porting, this is hardly something users
care much about. It is an implementors tradeoff. And then, it depends on the
people. Porting BIM_Prolog 2.4 to VAX BSD took 14 days to get an alfa
release produkt, meaning a lot of testing is included in this time.
We find this very acceptable, although the work itself is boring.

As a last remark, the reference given by Keppel to the article of Kogge
is interesting for all who want to read something about threaded code
and how it should be done, if at all.

Andre' Marien, B.I.M.
Bart Demoen,   K.U.Leuven