Xref: utzoo comp.ai:2172 sci.lang:2991 Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!rochester!udel!princeton!mind!harnad From: harnad@mind.UUCP (Stevan Harnad) Newsgroups: comp.ai,sci.lang Subject: Pinker & Prince Reply (long version) Keywords: connectionism, symbolic rules, learnability, past tense formation Message-ID: <2817@mind.UUCP> Date: 1 Sep 88 19:09:35 GMT References: <2816@mind.UUCP> Followup-To: comp.ai Organization: Cognitive Science, Princeton University Lines: 243 Posted for Pinker & Prince by S. Harnad ------------------------------------------------------------------ From: Steve Pinker To: Stevan Harnad (harnad@mind.princeton.edu) Site: MIT Center for Cognitive Science Subject: answers to S. Harnad's questions, longer version This letter is a reply to your posted list of questions and observations alluding to our paper "On language and connectionism: Analysis of a PDP model of language acquisition" (Pinker & Prince, 1988; see also Prince and Pinker, 1988). The questions are based on misunderstandings of our papers, in which they are already answered. (1) Contrary to your suggestion, we never claimed that pattern associators cannot learn the past tense rule, or anything else, in principle. Our concern is with which theories of the psychology of language are true. This question cannot be answered from an archair but only by examining what people learn and how they learn it. Our main conclusion is that the claim that the English past tense rule is learned and represented as a pattern-associator with distributed representations over phonological features for input and output forms (e.g., the Rumelhart-McClelland 1986 model) is false. That's because what pattern-associators are good at is precisely what the regular rule doesn't need. Pattern associators are designed to pick up patterns of correlation among input and output features. The regular past tense alternation, as acquired by English speakers, is not systematically sensitive to phonological features. Therefore some of the failures of the R-M model we found are traceable to its trying to handle the regular rule with an architecture inappropriate to the regular rule. We therefore predict that these failures should be seen in other network models that compute the regular past tense alternation using pattern associators with distributed phonological representations (*not* all conceivable network models, in general, in principle, forever, etc.). This prediction has been confirmed. Egedi and Sproat (1988) devised a network model that retained the assumption of associations between distributed phonological representations but otherwise differed radically from the R-M model: it had three layers, not two; it used a back-propagation learning rule, not just the simple perceptron convergence procedure; it used position-specific phonological features, not context-dependent ones; and it had a completely different output decoder. Nonetheless its successes and failures were virtually identical to those of the R-M model. (2) You claim that "the regularities you describe -- both in the irregulars and the regulars -- are PRECISELY the kinds of invariances you would expect a statistical pattern learner that was sensitive to higher order correlations to be able to learn successfully. In particular, the form-independent default option for the regulars should be readily inducible from a representative sample." This is an interesting claim and we strongly encourage you to back it up with argument and analysis; a real demonstration of its truth would be a significant advance. It's certainly false of the R-M and Egedi-Sproat models. There's a real danger in this kind of glib commentary of trivializing the issues by assuming that net models are a kind of miraculous wonder tissue that can do anything. The brilliance of the Rumelhart and McClelland (1986) paper is that they studiously avoided this trap. In the section of their paper called "Learning regular and exceptional patterns in a pattern associator" they took great pains to point out that pattern associators are good at specific things, especially exploiting statistical regularities in the mapping from one set of featural patterns to another. They then made the interesting emprical claim that these basic properties of the pattern associator model lie at the heart of the acquisition of the past tense. Indeed, the properties of the model afforded it some interesting successes with the *irregular* alternations, which fall into family resemblance clusters of the sort that pattern associators handle in interesting ways. But it is exactly these properties of the model that made it fail at the *regular* alternation, which does not form family resemblance clusters. We like to think that these kinds of comparisons make for productive empirical science. The successes of the pattern associator architecture for irregulars teaches us something about the psychology of the irregulars (basically a memory phenomenon, we argue), and its failures for the regulars teach us something about the psychology of the regulars (use of a default rule, we argue). Rumelhart and McClelland disagree with us over the facts but not over the key emprical tests. They hold that pattern associators have particular aptitudes that are suited to modeling certain kinds of processes, which they claim are those of cognition. One can argue for or against this and learn something about psychology while so doing. Your claim about a 'statistical pattern learner...sensitive to higher order correlations' is essentially impossible to evaluate. (3) We're mystified that you attribute to us the claim that "past tense formation is not learnable in principle." The implication is that our critique of the R-M model was based on the assertion that the rule is unlearned and that this is the key issue separating us from R&M. Therefore -- you seem to reason -- if the rule is learned, it is learned by a network. But both parts are wrong. No one in his right mind would claim that the English past tense rule is "built in". We spent a full seven pages (130-136) of 'OLC' presenting a simple model of how the past tense rule might be learned by a symbol manipulation device. So obviously we don't believe it can't be learned. The question is how children in fact do it. The only way we can make sense of this misattribution is to suppose that you equate "learnable" with "learnable by some (nth-order) statistical algorithm". The underlying presupposition is that statistical modeling (of an undefined character) has some kind of philosophical priority over other forms of analysis; so that if statistical modeling seems somehow possible-in-principle, then rule-based models (and the problems they solve) can be safely ignored. As a kind of corollary, you seem to assume that unless the input is so impoverished as to rule out all statistical modeling, rule theories are irrelevant; that rules are impossible without major stimulus-poverty. In our view, the question is not CAN some (ungiven) algorithm 'learn' it, but DO learners approach the data in that fashion. Poverty-of-the-stimulus considerations are one out of many sources of evidence in this issue. (In the case of the past tense rule, there is a clear P-of-S argument for at least one aspect of the organization of the inflectional system: across languages, speakers automatically regularize verbs derived from nouns and adjectives (e.g., 'he high-sticked/*high-stuck the goalie'; she braked/*broke the car'), despite virtually no exposure to crucial informative data in childhood. This is evidence that the system is built around representations corresponding to the constructs 'word', 'root', and 'irregular'; see OLC 110-114.) (4) You bring up the old distinction between rules that describe overall behavior and rules that are explicitly represented in a computational device and play a causal role in its behavior. Perhaps, as you say, "these are not crisp issues, and hence not a solid basis for a principled critique". But it was Rumelhart and McClelland who first brought them up, and it was the main thrust of their paper. We tend to agree with them that the issues are crisp enough to motivate interesting research, and don't just degenerate into discussions of logical possibilities. We just disagree about which conclusions are warranted. We noted that (a) the R-M model is empirically incorrect, therefore you can't use it to defend any claims for whether or not rules are explicitly represented; (b) if you simply wire up a network to do exactly what a rule does, by making every decision about how to build the net (which features to use, what its topology should be, etc.) by consulting the rule-based theory, then that's a clear sense in which the network "implements" the rule. The reason is that the hand-wiring and tweaking of such a network would not be motivated by principles of connectionist theory; at the level at which the manipulations are carried out, the units and connections are indistinguishable from one another and could be wired together any way one pleased. The answer to the question "Why is the network wired up that way?" would come from the rule-theory; for example, "Because the regular rule is a default operation that is insensitive to stem phonology". Therefore in the most interesting sense such a network *is* a rule. The point carries over to more complex cases, where one would have different subnetworks corresponding to different parts of rules. Since it is the fact that the network implements such-and-such a rule that is doing the work of explaining the phenomenon, the question now becomes, is there any reason to believe that the rule is implemented in that way rather some other way? Please note that we are *not* asserting that no PDP model of any sort could ever acquire linguistic knowledge without directly implementing linguistic rules. Our hope, of course, is that as the discussion proceeds, models of all kinds will be become more sophisticated and ambitious. As we said in our Conclusion, "These problems are exactly that, problems. They do not demonstrate that interesting PDP models of language are impossible in principle. At the same time, they show that there is no basis for the belief that connectionism will dissolve the difficult puzzles of language, or even provide radically new solutions to them." So to answer the catechism: (a) Do we believe that English past tense formation is not learnable? Of course we don't! (b) If it is learnable, is it specifically unlearnable by nets? No, there may be some nets that can learn it; certainly any net that is intentionally wired up to behave exactly like a rule-learning algorithm can learn it. Our concern is not with (the mathematical question of) what nets can or cannot do in principle, but about which theories are true, and our analysis was of pattern associators using distributed phonological representations. We showed that it is unlikely that human children learn the regular rule the way such a pattern associator learns the regular rule, because it is simply the wrong tool for the job. Therefore it's not surprising that the developmental data confirm that children do not behave the way such a pattern associator behaves. (c) If past tense formation is learnable by nets, but only if the invariance that the net learns and that causally constrains its successful performance is describable as a "rule", what's wrong with that? Absolutely nothing! -- just like there's nothing wrong with saying that past tense formation is learnable by a bunch of precisely-arranged molecules (viz., the brain) such that the invariance that the molecules learn, etc. etc. The question is, what explains the facts of human cognition? Pattern associator networks have some interesting properties that can shed light on certain kinds of phenomena, such as irregular past tense forms. But it is simply a fact about the regular past tense alternation in English that it is not that kind of phenomenon. You can focus on the interesting empirical properties of pattern associators, and use them to explain certain things (but not others), or you can generalize them to a class of universal devices that can explain nothing without appeals to the rules that they happen to implement. But you can't have it both ways. Steven Pinker Department of Brain and Cognitive Sciences E10-018 MIT Cambridge, MA 02139 steve@cogito.mit.edu Alan Prince Program in Cognitive Science Department of Psychology Brown 125 Brandeis University Waltham, MA 02254-9110 prince@brandeis.bitnet References: Egedi, D.M. and R.W. Sproat (1988) Neural Nets and Natural Language Morphology, AT&T Bell Laboratories, Murray Hill,NJ, 07974. Pinker, S. & Prince, A. (1988) On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73-193. Reprinted in S. Pinker & J. Mehler (Eds.), Connections and symbols. Cambridge, MA: Bradford Books/MIT Press. Prince, A. & Pinker, S. (1988) Rules and connections in human language. Trends in Neurosciences, 11, 195-202. Rumelhart, D. E. & McClelland, J. L. (1986) On learning the past tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, & The PDP Research Group, Parallel distributed processing: Explorations in the microstructure of cognition. Volume 2: Psychological and biological models. Cambridge, MA: Bradford Books/MIT Press. ------------------------------------------------------------- Posted for Pinker & Prince by: -- Stevan Harnad ARPANET: harnad@mind.princeton.edu harnad@princeton.edu harnad@confidence.princeton.edu srh@flash.bellcore.com harnad@mind.uucp BITNET: harnad%mind.princeton.edu@pucc.bitnet UUCP: princeton!mind!harnad CSNET: harnad%mind.princeton.edu@relay.cs.net