Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ames!lll-winken!uunet!mcvax!prlb2!kulcs!bimandre From: bimandre@kulcs.uucp (Andre Marien) Newsgroups: comp.lang.prolog Subject: Inline expansion versus threaded code Message-ID: <1654@kulcs.kulcs.uucp> Date: 28 Apr 89 06:55:14 GMT Reply-To: bimandre@kulcs.UUCP () Organization: B.I.M., Belgium Lines: 269 (This reply became rather long ...) First, a summary of responses in chronological order. : It seems obvious to me that threaded code will page much more. Inline code : will avoid calls to potentially far-away memory locations that require paging : in memory. ... Just my intuitions ... : Bruce Krulwich : You may want to look into Envos, a recent Xerox spin-off marketing Xerox : AI technology. They have ported Interlisp-D, which runs on Xerox : D-machines (microcoded Bytecode), to the Sun platform by writing a : Bytecode emulator. They apparently get quite good performance (emulator : fits in 64K cache) and extremely dense code (5Meg for a full system vs. : 30-40Meg for similar Symbolics image). Native code would be faster, but : perhaps not significantly, and would cause very significant code : expansion. ... : Johannes A. G. M. Koomen : I'm shure You know it, but however: E.Tick Memory Performance of Prolog : Architectures, Kluwers. : ... for very short, what is the essence of this book: the WAM is not bad, : but its only pitfall is the great memory throughput because of : shallow backtracking ... the primary problem is __PAGING__!!! : ... : The biggest problem is not inline or not but the abstract machine. If it : has inherent failures inline expansion will have no great effects. And : there lies the difference between Prolog and other programming languages. : For the WAM inline expansion is indeed very attractive for determinate : parts of programs. But when shallow backtracking is present the savings of : the instruction fetches are neglectable. When designing a Prolog-machine : the primary concern should be the locality of memory references, inline : expansion should be seen as an improvement for some higly used code. If one : considers big Prolog applications (eg TWAICE) one sees clearly that inline : code can be used for the smallest parts only. : Ulrich Neumerkel : Our experience with large Prolog and C programs for natural-language : analysis, grammar analysis, simulation and speech understanding is : that soon after your program starts paging (even on an empty 32MB : Sun-4/280 with local SMD disk), you may as well forget about it. The : problem is that all of those applications create or use very large : relations over complex terms leading to essentially random access to : most of program memory for Prolog or data area for C. : ... : It doesn't matter much to us that native code might be : between 1.5 and 3 times faster than threaded code on small or : well-behaved benchmarks, since native code would be too bulky to run : at all! : ... : Fernando Pereira : : My figures for native vs. threaded code came from comparing Quintus : Prolog with two commercial native code compilers for Prolog two years : ago, and from discussions with Richard O'Keefe about Prolog : compilation. Native code compilation techniques may have improved : since then. : The problem of access to rules is more complicated in parsers that : do not use the rules directly but instead use some kind of parsing : tables derived from the rules. Usually, the tables are much larger : than the original rule set. : ... : Fernando Pereira : When I was visiting SICS over the northern winter I ran the MTS natural : language system, written by Xiuming Huang, under Sicstus Prolog using : the bytecode emulator and the (new) native code system. I have to agree : that paging really kills the system. : ... : About half the code in MTS is complex dcg rules and about half is the : lexicon (large sets of facts). I'm not sure how the size and speed : ratios compare with these two rather different types of code. It may be : important for optimization (eg, it might be best to compile one with : native code and emulate the other). : Locality is another important : issue. The working set of the lexicon code is related to the number of : different words in the sentence/text being processed. This is probably : reasonably small, even though the fine grain locality is poor. For the : grammar rules, fine grain locality is probably better, but the quantity : of code used overall is larger. : Lee Naish : Point 1: the optimised native code is about twice as fast as the : threaded code, because it doesn't have the overhead of the jmp *(SIMPC)++ : instructions, and perhaps more importantly because it is possible to : combine the effects of a series of instructions moving things around into : a simpler series putting things directly in their final place. : Point 2: the emulated instructions might have occupied 6 (e.g. SB Prolog) : or 12 (e.g. Quintus Prolog) bytes. A VAX version of the optimised native : code takes on the close order of 60. That's for just one example, but it : was to some extent a favourable example. : Point 3: the emulator for an abstract machine is likely to be very small. : Emulated code just won't be paging the emulator, and it is likely to get : much higher locality of code reference than native code, which is good : news for caches. The emulated instructions are small, so they are good : news for data caches. The emulated instructions flow through the data : cache, which is bad news for data caches, because the real data have to : go through the same cache. : Point 4: emulated systems can be easier to port. : It's all a trade off. : Richard A. O'Keefe : Eckouse's book on PDP-11 programming (mid 70's) contains a section on : inline-vs.-threaded-vs-`knotted' code. This is probably the best : description that I have seen in print. (`Knotted' code is his name : for a cute optimization on threading). He includes a single data : point for a `real' program. The : data point supports his claim that you sometimes get better : performance from threaded code than from inline code. : ... : several observations: : * For systems that rely heavily on caching, and particularly where : caching actions (e.g., prefetch) can hurt performance (e.g, bus : contention on a shared-memory multiprocessor), inlining is a loss on : most large functions. : * Threading is harder to apply and probably less general. It is of : obvious utility in interpreters and for iterations over a fixed : parameter list. It is less obvious how to use threading for : collections of arbitrary functions. It is still less obvious what : is `best' when the function must have two calling conventions (the : `most general' and the optimized calls). : ... : %X /u1/rrh/bib/rrh.d/peterd.ref : %T An Architectural Trail to Threaded-Code Systems : %A Peter M. Kogge : %P 22-33 : %J COMP : %V 15 : %N 3 : %D MAR 1982 : %K forth : ... : Pardo : ... : : * Global register allocation. : ... : Pardo : I think the first is the most relevant, the following, you might find usefull. : : @InProceedings(clocksin85, : Key="Clocksin", : Author="W. F. Clocksin", : Title="Implementation Techniques for Prolog Databases", : BookTitle="Software: Practice and Experience", : Pages="1-8", : Year="1985") : @comment{Keywords= assert} : ... : Periklis Tsahageas The basic reason why we believe native code generation is better than threaded code, is speed. Virtual machines, base register adressing, optimization analysis, caches all start with the principle of locality of code and data reference. You find this in the rule 90% of time in 10% of the code, the name 'working set' and others. So even if the amount of code produced when generating native code is a lot higher, the working set itself will be moderate. Having a complete emulator in 64k as Koomen reports, and we assume Quintus uses, is nice of course, if your performance is not hurt too much. You can plug in a 12Mbyte board instead of a 4Mbyte board; doing the same with a processor 12 Mips instead of 4 Mips may neither be as easy nor as effective. In EUUG '89 David Rosenthal writes about a related topic, observed performance, and says that a workstation performance depends heavily on the network/disk speed, which can be improved by increasing memory, and if all that behaves well, than you will see the effect of lower cputime. People tend to use the systems resources to the edge where trashing is going to occur. This gives unexpected drops in performance if you just go a little further. It is obvious that programs exist were the small increase in paging due to larger code may just do that. It is also true that for some programs, it happens with both native and threaded code because the cause is the heap reference, or the local stack, not code reference. A program used as benchmark and written by O'Keefe, doing fft, would run fine with the query u_time(9) and trash with u_time(10). No relation with the code size, obviously ! Of course, this is not new. See the remarks of Neumerkel, and the well known book of Tick. The plm_compiler, which has a non-benchmark size, runs happily with native code. No thrashing. That should answer the 'well-behaved benchmarks' remark of Pereira. It may come as a suprise, but benchmarks tend to be in favor of threaded code ! Just think of those long repetitions in some benchmark suites, all using the same emulator instructions over and over again, whereas in native code it is a brandnew instruction each time ... . There is a counterpart of course. Any small recursive benchmark predicate will be in the cache of a 68020 without much trouble, for native code. With a thread, you may gain more than 10% of speed by tuning the emulator to append/3 (concat/3). Do walk-list and walk-list-recursive ring a bell ? The first gives overrated timings for threaded code, the second shows native code at its best. A good point is made by Pereira and agreed upon by Naish. Both mention the problem that prolog programs contain two very different kinds of code. You have the 'real' prolog predicates, and you have the facts. These are quit different beasts, and we agree almost completely with Naish that it is best to compile one with native code and emulate the other. This is precisely what is possible with BIM_Prolog. We support this idea at different levels. First of all, there is a distinction between dynamic and static code (code which is modifiable or code which is not). There are cases where this is sufficient. As both are code, we compile both, static code to native code, dynamic code only to WAM. For fast access, there is indexing on one user selectable argument for dynamic code, and up to three for static code. If dynamic code access time is crucial, hashing can be added, which gives a drastic page improvement if this would have been a problem with the normal indexing. We could construct an artificial testcase were all of a sudden response dropped below acceptable level, using the emulated code, which is about the same density as threaded code, and using a lot of facts with multiple arguments. Hashing removed the problem at once. Conclusion : running a benchmark on different systems without using any tuning towards the implementation does not show its capacitys. There is a second point about which we have been trying to convince people - with more or less with success - is the principle that data and program are not the same, despite its theoretical intersting properties. What we mean is that using assert/retract for global variable simulation is bad. We know 'global variable' is a bad thing, but we have both our feet on the ground : sometimes it is better to have them. Just as goto's (never used longjump in C?). Therefore we have a set of predicate to handle them efficiently. The wise can use it to their advantage and others should never write prolog. The third point is the use of a database on disk. This may be a standard relational database or a special prolog database. If your database is big, paging will eventually be as slow as disk accesses. Maybe your database caching will be even better. We could easily construct cases where a trashing program was significantly slower than using a special database. As (prolog) compiler technology advances, less instructions will be needed. And of course, if the code is to long, you can always use a kind of call to an out-of-line routine, gaining nothing, losing nothing compared with the thread. Currently, many WAM instructions can be compiled in just a few instructions. When using mode declarations, code size can often be reduced substantially. Of course, you have to put them in your programs to notice the reduction in code size. There will be a paper at ICLP 89 showing what the future may bring. An item mentioned by Pardo, is only applicable when using native code : global register allocation. WAM register alloaction is still not fully solved, due to tradeoffs and the interaction with things like shallow backtracking. Not many attemps for global register allocation have been made. As can be seen in Lisp, it could be usefull. O'Keefe says 'It's all a trade off'. With this, we agree. The size of code and the speedups are easy to trade off. If you develop programs in prolog which need to compete with C-coded programs, you need top speed. The arguments about caching are not that clear, we doubt he has a point. See above. As for the ease of porting, this is hardly something users care much about. It is an implementors tradeoff. And then, it depends on the people. Porting BIM_Prolog 2.4 to VAX BSD took 14 days to get an alfa release produkt, meaning a lot of testing is included in this time. We find this very acceptable, although the work itself is boring. As a last remark, the reference given by Keppel to the article of Kogge is interesting for all who want to read something about threaded code and how it should be done, if at all. Andre' Marien, B.I.M. Bart Demoen, K.U.Leuven