Xref: utzoo comp.ai:2173 sci.lang:2992 Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!rochester!udel!princeton!mind!harnad From: harnad@mind.UUCP (Stevan Harnad) Newsgroups: comp.ai,sci.lang Subject: Re: Pinker & Prince Reply (long version) Keywords: connectionism, symbolic rules, learnability, past tense formation Message-ID: <2818@mind.UUCP> Date: 1 Sep 88 19:13:37 GMT References: <2816@mind.UUCP> <2817@mind.UUCP> Followup-To: comp.ai Organization: Cognitive Science, Princeton University Lines: 301 ON THEFT VS HONEST TOIL Pinker & Prince (prince@mit.cogito.edu) write in reply: >> Contrary to your suggestion, we never claimed that pattern associators >> cannot learn the past tense rule, or anything else, in principle. I've reread the paper, and unfortunately I still find it ambiguous: For example, one place (p. 183) you write: "These problems are exactly that, problems. They do not demonstrate that interesting PDP models of language are impossible in principle." But elsewhere (p. 179) you write: "the representations used in decomposed, modular systems are abstract, and many aspects of their organization cannot be learned in any obvious way." [Does past tense learning depend on any of this unlearnable organization?] On p. 181 you write: "Perhaps it is the limitations of these simplest PDP devices -- two-layer association networks -- that causes problems for the R & M model, and these problems would diminish if more sophisticated kinds of PDP networks were used." But earlier on the same page you write: "a model that can learn all possible degrees of correlation among a set of features is not a model of a human being" [Sounds like a Catch-22...] It's because of this ambiguity that my comments were made in the form of conditionals and questions rather than assertions. But we now stand answered: You do NOT claim "that pattern associaters cannot learn the past tense rule, or anything else, in principle." [Oddly enough, I do: if by "pattern associaters" you mean (as you mostly seem to mean) 2-layer perceptron-style nets like the R & M model, then I would claim that they cannot learn the kinds of things Minsky showed they couldn't learn, in principle. Whether or not more general nets (e.g., PDP models with hidden layers, back-prop, etc.) will turn out to have corresponding higher-order limitations seems to be an open question at this point.] You go on to quote my claim that: "the regularities you describe -- both in the irregulars and the regulars -- are PRECISELY the kinds of invariances you would expect a statistical pattern learner that was sensitive to higher order correlations to be able to learn successfully. In particular, the form-independent default option for the regulars should be readily inducible from a representative sample." and then you comment: >> This is an interesting claim and we strongly encourage you to back it >> up with argument and analysis; a real demonstration of its truth would >> be a significant advance. It's certainly false of the R-M and >> Egedi-Sproat models. There's a real danger in this kind of glib >> commentary of trivializing the issues by assuming that net models are >> a kind of miraculous wonder tissue that can do anything. I don't understand the logic of your challenge. You've disavowed having claimed that any of this was unlearnable in principle. Why is it glibber to conjecture that it's learnable in practice than that it's unlearnable in practice? From everything you've said, it certainly LOOKS perfectly learnable: Sample a lot of forms and discover that the default regularity turns out to work well in most cases (i.e., the "regulars"; the rest, the "irregulars," have their own local invariances, likewise inducible from statistical regularities in the data). This has nothing to do with a belief in wonder tissue. It was precisely in order to avoid irrelevant stereotypes like that that the first posting was prominently preceded by the disclaimer that I happen to be a sceptic about connectionism's actual accomplishments and an agnostic about its future potential. My critique was based solely on the logic of your argument against connectionism (in favor of symbolism). Based only on what you've written about its underlying regularities, past tense rule learning simply doesn't seem to pose a serious challenge for a statistical learner -- not in principle, at any rate. It seems to have stumped R & M 86 and E & S 88 in practice, but how many tries is that? It is possible, for example, as suggested by your valid analysis of the limitations of the Wickelfeature representation, that some of the requisite regularities are simply not reflected in this phonological representation, or that other learning (e.g. plurals) must complement past-tense data. This looks more like an entry-point problem (see (1) below), however, rather than a problem of principle for connectionist learning of past tense formation. After all, there's no serious underdetermination here; it's not like looking for a needle in a haystack, or NP-complete, or like that. I agree that R & M made rather inflated general claims on the basis of the limited success of R & M 86. But (to me, at any rate) the only potentially substantive issue here seems to be the one of principle (about the relative scope and limits of the symbolic vs. the connectionistic approach). Otherwise we're all just arguing about the scope and limits of R & M 86 (and perhaps now also E & S 88). Two sources of ambiguity seem to be keeping this disagreement unnecessarily vague: (1) There is an "entry-point" problem in comparing a toy model (e.g., R & M 86) with a lifesize cognitive capacity (e.g., the human ability to form past tenses): The capacity may not be modular; it may depend on other capacities. For example, as you point out in your article, other phonological and morphological data and regularities (e.g., pluralization) may contribute to successful past tense formation. Here again, the challenge is to come up with a PRINCIPLED limitation, for otherwise the connectionist can reasonably claim that there's no reason to doubt that those further regularities could have been netted exactly the same way (if they had been the target of the toy model); the entry point just happened to be arbitrarily downstream. I don't say this isn't hand-waving; but it can't be interestingly blocked by hand-waving in the opposite direction. (2) The second factor is the most critical one: learning. You put a lot of weight on the idea that if nets turn out to behave rulefully then this is a vindication of the symbolic approach. However, you make no distinction between rules that are built in (as "constraints," say) and rules that are learned. The endstate may be the same, but there's a world of difference in how it's reached -- and that may turn out to be one of the most important differences between the symbolic approach and connectionism: Not whether they use rules, but how they come by them -- by theft or honest toil. Typically, the symbolic approach builds them in, whereas the connectionistic one learns them from statistical regularities in its input data. This is why the learnability issue is so critical. (It is also what makes it legitimate for a connectionist to conjecture, as in (1) above, that if a task is nonmodular, and depends on other knowledge, then that other knowledge too could be acquired the same way: by learning.) >> Your claim about a 'statistical pattern learner...sensitive to higher >> order correlations' is essentially impossible to evaluate. There are in principle two ways to evaluate it, one empirical and open-ended, the other analytical and definitive. You can demonstrate that specific regularities can be learned from specific data by getting a specific learning model to do it (but its failure would only be evidence that that model fails for those data). The other way is to prove analytically that certain kinds of regularities are (or are not) learnable from certain kinds of data (by certain means, I might add, because connectionism may be only one candidate class of statistical learning algorithms). Poverty-of-the-stimulus arguments attempt to demonstrate the latter (i.e., unlearnability in principle). >> We're mystified that you attribute to us the claim that "past >> tense formation is not learnable in principle."... No one in his right >> mind would claim that the English past tense rule is "built in". We >> spent a full seven pages (130-136) of 'OLC' presenting a simple model >> of how the past tense rule might be learned by a symbol manipulation >> device. So obviously we don't believe it can't be learned. Here are some extracts from OLC 130ff: "When a child hears an inflected verb in a single context, it is utterly ambiguous what morphological category the inflection is signalling... Pinker (1984) suggested that the child solves this problem by "sampling" from the space of possible hypotheses defined by combinations of an innate finite set of elements, maintaining these hypotheses in the provisional grammar, and testing them against future uses of that inflection, expunging a hypothesis if it is counterexemplified by a future word. Eventually... only correct ones will survive." [The text goes on to describe a mechanism in which hypothesis strength grows with success frequency and diminishes with failure frequency through trial and error.] "Any adequate rule-based theory will have to have a module that extracts multiple regularities at several levels of generality, assign them strengths related to their frequency of exemplification by input verbs, and let them compete in generating a past tense for for a given verb." It's not entirely clear from the description on pp. 130-136 (probably partly because of the finessed entry-point problem) whether (i) this is an innate parameter-setting or fine-tuning model, as it sounds, with the "learning" really just choosing among or tuning the built-in parameter settings, or whether (ii) there's genuine bottom-up learning going on here. If it's the former, then that's not what's usually meant by "learning." If it's the latter, then the strength-adjusting mechanism sounds equivalent to a net, one that could just as well have been implemented nonsymbolically. (You do state that your hypothetical module would be equivalent to R & M's in many respects, but it is not clear how this supports the symbolic approach.) [It's also unclear what to make of the point you add in your reply (again partly because of the entry-point problem): >>"(In the case of the past tense rule, there is a clear P-of-S argument for at least one aspect of the organization of the inflectional system...)">> Is this or is this not a claim that all or part of English past tense formation is not learnable (from the data available to the child) in principle? There seems to be some ambiguity (or perhaps ambivalence) here.] >> The only way we can make sense of this misattribution is to suppose >> that you equate "learnable" with "learnable by some (nth-order) >> statistical algorithm". The underlying presupposition is that >> statistical modeling (of an undefined character) has some kind of >> philosophical priority over other forms of analysis; so that if >> statistical modeling seems somehow possible-in-principle, then >> rule-based models (and the problems they solve) can be safely ignored. Yes, I equate learnability with an algorithm that can extract statistical regularities (possibly nth order) from input data. Connectionism seems to be (an interpretation of) a candidate class of such algorithms; so does multiple nonlinear regression. The question of "philosophical priority" is a deep one (on which I've written: "Induction, Evolution and Accountability," Ann. NY Acad. Sci. 280, 1976). Suffice it to say that induction has epistemological priority over innatism (or such a case can be made) and that a lot of induction (including hypothesis-strengthening by sampling instances) has a statistical character. It is not true that where statistical induction is possible, rule-based models must be ignored (especially if the rule-based models learn by what is equivalent to statistics anyway), only that the learning NEED not be implemented symbolically. But it is true that where a rule can be learned from regularities in the data, it need not be built in. [Ceterum sentio: there is an entry-point problem for symbols that I've also written about: "Categorical Perception," Cambr. U. Pr. 1987. I describe there a hybrid approach in in which symbolic and nonsymbolic representations, including a connectionistic component, are put together bottom-up in a principled way that avoids spuriously pitting connectionism against symbolism.] >> As a kind of corollary, you seem to assume that unless the input is so >> impoverished as to rule out all statistical modeling, rule theories >> are irrelevant; that rules are impossible without major stimulus-poverty. No, but I do think there's an entry-point problem. Symbolic rules can indeed be used to implement statistical learning, or even to preempt it, but they must first be grounded in nonsymbolic learning or in innate structures. Where there is learnability in principle, learning does have "philosophical (actually methodological) priority" over innateness. >> In our view, the question is not CAN some (ungiven) algorithm >> 'learn' it, but DO learners approach the data in that fashion. >> Poverty-of-the-stimulus considerations are one out of many >> sources of evidence in this issue... >> developmental data confirm that children do not behave the way such a >> pattern associator behaves. Poverty-of-the-stimulus arguments are the cornerstone of modern linguistics because, if they are valid, they entail that certain rules (or constraints) are unlearnable in principle (from the data available to the child) and hence that a learning model must fail for such cases. The rule system itself must accordingly be attributed to the brain, rather than just the general-purpose inductive wherewithal to learn the rules from experience. Where something IS learnable in principle, there is of course still a question as to whether it is indeed learned in practice rather than being innate; but neither (a) the absence of data on whether it is learned nor (b) the existence of a rule-based model that confers it on the child for free provide very strong empirical guidance in such a case. In any event, developmental performance data themselves seem far too impoverished to decide between rival theories at this stage. It seems advisable to devise theories that account for more lifesize chunks of our asymptotic (adult) performance capacity before trying to fine-tune them with developmental (or neural, or reaction-time, or brain-damage) tests or constraints. (Standard linguistic theory has in any case found it difficult to find either confirmation or refutation in developmental data to date.) By way of a concrete example, suppose we had two pairs of rival toy models, symbolic vs. connectionistic, one pair doing chess-playing and the other doing factorials. (By a "toy" model I mean one that models some arbitrary subset of our total cognitive capacity; all models to date, symbolic and connectionistic, are toy models in this sense.) The symbolic chess player and the connectionistic chess player both perform at the same level; so do the symbolic and connectionistic factorializer. It seems evident that so little is known about how people actually learn chess and factorials that "developmental" support would hardly be a sound basis for choosing between the respective pairs of models (particularly because of the entry-point problem, since these skills are unlikely to be acquired in isolation). A much more principled way would be to see how they scaled up from this toy skill to more and more lifesize chunks of cognitive capacity. (It has to be conceded, however, that the connectionist models would have a marginal lead in this race, because they would already be using the same basic [statistical learning] algorithm for both tasks, and for all future tasks, presumably, whereas the symbolic approach would have to be making its rules on the fly, an increasingly heavy load.) I am agnostic about who would win this race; connectionism may well turn out to be side-lined early because of a higher-order Perceptron-like limit on its rule-learning ability, or because of principled unlearnability handicaps. Who knows? But the race is on. And it seems obvious that it's far too early to use developmental (or neural) evidence to decide which way to bet. It's not even clear that it will remain a 2-man race for long -- or that a finish might not be more likely as a collaborative relay. (Nor is the one who finishes first or gets farthest guaranteed to be the "real" winner -- even WITH developmental and neural support. But that's just normal underdetermination.) >> if you simply wire up a network to do exactly what a rule does, by >> making every decision about how to build the net (which features to >> use, what its topology should be, etc.) by consulting the rule-based >> theory, then that's a clear sense in which the network "implements" >> the rule What if you don't WIRE it up but TRAIN it up? That's the case at issue here, not the one you describe. (I would of course agree that if nets wire in a rule as a built-in constraint, that's theft, not honest toil, but that's not the issue!) -- Stevan Harnad ARPANET: harnad@mind.princeton.edu harnad@princeton.edu harnad@confidence.princeton.edu srh@flash.bellcore.com harnad@mind.uucp BITNET: harnad%mind.princeton.edu@pucc.bitnet UUCP: princeton!mind!harnad CSNET: harnad%mind.princeton.edu@relay.cs.net