Path: utzoo!attcan!uunet!munnari.oz.au!goanna!ok From: ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: nlp code: request for comments Message-ID: <4156@goanna.cs.rmit.oz.au> Date: 30 Oct 90 11:09:11 GMT References: <4691@rex.cs.tulane.edu> Organization: Comp Sci, RMIT, Melbourne, Australia Lines: 237 In article <4691@rex.cs.tulane.edu>, rpg@rex.cs.tulane.edu (Robert Goldman) writes: > 4. I have used feature structures, along the lines of the ones in Gazdar > and Mellish's Natural Language Processing in Prolog, as a > representation for the quasi-logical form. It's worth noting that their representation for feature structures (an improper list of Feature:Value pairs) is more than somewhat ugly. A tiny touch of pre-processing can make the source code much clearer (declare feature clusters, e.g. :- features(case_frame, [agent,patient,beneficiary,...]) and then write rules that say e.g. p(Subj, Features, ...) --> {Features^[agent] = Subj}. ) and the code that actually _runs_ much faster (because unifications are done in-line *as* unifications, not as calls to a non-logical unify/3 or whatever it was. I guess I should tidy up the code I gave my students for this and post it. > I would like to give this semantic processor to my students to > examine, and I would appreciate it if any of you could comment on the > coding, and let me know if I have committed any Prolog solecisms. I hope you really meant that. > EDITORIAL COMMENT: > Quite frankly, I have had a fairly difficult time teaching this course > using Allen's book and Prolog together. I was doing the same thing exactly this year. We ended up using rather little of Allen. My students got a _lot_ of handouts to make up for it. > X a utility predicate > X new_atom(A) > X A must be unbound. Will be bound to a new name. The program appears to be using Quintus Prolog (use of library(basics) and the like...) What on earth was wrong with the existing library predicate gensym/1, or if the "foo" prefix was so very important, gensym/2? The :- dynamic declaration for count/1 should have been in the file gensym.pl (which is the _only_ file that has any business knowing about that predicate) not in load.pl. There are actually two essentially unrelated things going on in the file 'hierarchy.pl'. First, symbols are being mapped to classes. The improper-list-of-pairs encoding of feature structures is a rather poor representation for types. A much better representation is due to Chris Mellish. Suppose we have the single-inheritance 'ako' tree a b c d e f g with the individuals cee: c, dee: d, eff: f, gee: g. We would map the individuals to terms representing their types thus: type_of(cee, b(c(_))). type_of(dee, b(d(_))). type_of(eff, e(f(_))). type_of(gee, e(g(_))). More generally, for each / arc in the tree, we have class_type(, T0, T) :- class_type(, (T0), T). The top of the hierarchy as the corresponding rule class_type(, T, T). When we have an individual belonging to we say indiv_type(, T) :- class_type(, _, T). So here we have class_type(c, T0, T) :- class_type(b, c(T0), T). class_type(d, T0, T) :- class_type(b, d(T0), T). class_type(b, T0, T) :- class_type(a, b(T0), T). class_type(f, T0, T) :- class_type(e, f(T0), T). class_type(g, T0, T) :- class_type(e, g(T0), T). class_type(e, T0, T) :- class_type(a, e(T0), T). class_type(a, T, T). class_type(C, T) :- class_type(C, _, T). indiv_type(cee, T) :- class_type(c, T). indiv_type(dee, T) :- class_type(d, T). indiv_type(eff, T) :- class_type(f, T). indiv_type(gee, T) :- class_type(g, T). This too is the kind of thing that can be done rather neatly by a preprocessor. Now, imagine that we want to say that a particular verb must have an animate subject. We might say may_fill(subject, see, X) :- class_type(animate, T), indiv_type(X, T). % X's type is compatible with T where the class_type/2 call can be preprocessed away. Chris Mellish pointed out that this scheme generalises to systems where multiple classifications apply to the same thing. For example, something of type "agreement" might be classified according to "person", "number", and "gender", so we might have agreement(1 | 2 | 3, s | p, m | f | n) With that scheme, we can easily represent things like agreement(_,p,_) "plural" agreement(3,_,f) "third person feminine" and combine them: agreement(3,p,f) "third person plural feminine" This can be much more economical, and is in my view much clearer, than lists of unstructured atoms. MAKE UNIFICATION WORK FOR YOU! As a particular example of doing things clearly with terms instead of pounding away on lists, consider the complements of a verb phrase. Goldman's program does vp(...) --> ... {subcat(V, Subcat)}, compl(..., Subcat, Gap). compl(compl(nil),Subcat,X/X) --> {member(iv,Subcat)},[]. compl(compl(NP),Subcat,GapInfo) --> {member(tv,Subcat)}, np(NP,_,GapInfo). compl(compl(NP1,NP2),Subcat,GapIn/GapOut) --> {member(bv,Subcat)}, np(NP1,_,GapIn/Gap1), np(NP2,_,Gap1/GapOut). where subcat/2 returns a subset of {iv,tv,bv} represented as a list. But why use a list here? Suppose instead that we represent the verb subcategorisation as a triple v(i | 0, t | 0, b | 0) where i, t, b mean that the verb _can_ be used as an intransitive, transitive, or ditransitive-or-benefactive respectively, and 0 in a particular slot means it can't. Let's move this information to the front as well: it is always a good idea to have the argument which we're dispatching on be the first so that a human reader has the least possible trouble finding it. Then we get compl(v(i,_,_), comp0, Gap, Gap) --> []. compl(v(_,t,_), comp1(Np), Gap0, Gap) --> np(NP, _, Gap0, Gap). compl(v(_,_,b), comp2(N1,N2), Gap0, Gap) --> np(N1, _, Gap0, Gap1), np(N2, _, Gap1, Gap). There's a lot of left-over Lisp in the code. For example, there's a rule that starts out s(s(NP,VP),yes-no-q) --> ... Now yes-no-q is a perfectly good Lisp atom (it's a spelling of |YES-NO-Q|) but it is a compound term in Prolog -(-(yes,no),q). Why does that matter? Because a later rule tries to use it as a function symbol! A rather worse hangover (and if it isn't a headache now, it soon will be) from Lisp is the use of 'nil' as a "default" or "absent" marker. Here's a particularly important case. translate(String,Tree,Sem) :- parse(Tree,Mood,String), semantics(Tree,Mood,Sem). semantics(smaj(Tree),Mood,Sem) :- Mood \= wh_q, Sem =.. [Mood,SSem], sem_translate(Tree,SSem,nil). semantics(smaj(Tree),wh_q,wh_q(Whvar,SSem)) :- new_atom(Whvar), sem_translate(Tree,SSem,Whvar). In both of the calls to sem_translate/3 we pass an atom as the last argument. An atom spelled "nil" means "there isn't any Wh-variable". An atom spelled "foo123" or the like means "there is a Wh-variable called foo123". That is NOT good Prolog coding practice. What are the situations, and what are the associated data? - there is a Wh-variable X - there is no Wh-variable Invent names for these situations, and make the associated data the arguments of appropriate terms - var(X) means there is a Wh-variable X - novar means there is no Wh-variable Then later on we'll be able to ask "was there a Wh-variable" by doing Wh = var(_) instead of by doing Wh \== nil. That's far from the only problem here. The program does "Mood \= wh_q" in order to test whether Mood is decl or yn_q (assuming that yes-no-q should have been yn_q). There is no point in using (\=)/2 here; it would be better to use the built-in predicate (\==)/2. But it's better still to say exactly what you do mean, so that a human reader can see what the possible cases for Mood are. (The use of (=..)/2 is a fairly reliable cue that something rather strange is going on. This is the bit that breaks if Mood is yes-no-q.) The variable names aren't too good either: there isn't any String here; but there _is_ a list of Words. translate(Words, Tree, Sem) :- parse(Tree, Mood, Words), semantics(Mood, Tree, Sem). semantics(decl, smaj(Tree), decl(Sem)) :- sem_translate(Tree, Sem, novar). semantics(yn_q, smaj(Tree), yn_q(Sem)) :- sem_translate(Tree, Sem, novar). semantics(wh_q, smaj(Tree), wh_q(WhVar,Sem)) :- gensym(WhVar), sem_translate(Tree, Sem, var(WhVar)). And so it goes. It would improve the program a _lot_ to have a comment which says exactly what a Tree or a Sem can look like. There's a lot more that could be said. One thing that _does_ need to be said is that I was very pleased to see this posting, and I've put a copy of it where my students can get at it. Never mind the flaws, at least it's _there_ and it's a place to _start_. Much the same can be said about the code in the Gazdar & Mellish book; the code there isn't very good, but it's _there_ and is a place to _start_, whereas Allen leaves you pretty much on your own. -- The problem about real life is that moving one's knight to QB3 may always be replied to with a lob across the net. --Alasdair Macintyre.