Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!hplabs!hpcc05!hpwrce!kingsley From: kingsley@hpwrce.HP.COM (Kingsley Morse) Newsgroups: comp.ai Subject: Re: Re: The AI Breakthough -- What It Will Be Like !!! Message-ID: <1870008@hpwrce.HP.COM> Date: 12 Nov 90 21:32:34 GMT References: <1990Nov5.145716.19427@cec1.wustl.edu> Organization: HP Western Response Center Lines: 34 ddj7203@cec1.wustl.edu (David D. Jensen) writes: > How well does the algorithm classify new cases? > just take a pool of (what we think are) representative problems > and test the various algorithms. This is a good point that I've been struggling with in my research. Indeed, a classification algorithm's ability to generalize what it learned from the training set to correctly classify new cases IS important. But frankly, I don't know how to match a classification algorithm with a representative problem. In other words, if I have a classification problem in mind, and training data in hand, how do I know which classification algorithm will best classify new cases? It seems to me that testing classification algorithms with representative problems would be unreliable; there's always another classification algorithm that partitions the space so it is more like the the test data. For example, the nearest neighbor algorithm will work well with test data that is centrally clustered in euclidian space. Multiple regression will work well with test data that is partitioned into linear regions. But how do you know in advance if the test data is clustered in euclidian spheres, linear regions, etc....? Even then, it's a safe bet that a nearest neighbor algorithm could approximate a multiple regression algorithm, or vica versa, given enough training patterns. This is similar to a Turing machine, in which a simple cpu can emulate any other cpu given enough time. In this case, we can assume that most classification algorithms can emulate most other classification algorithms, given enough time and training patterns. So although any classification algorithm will do, given enough time and training patterns, perhaps we should develop an algorithm which is capable of partitioning the data space several ways, ie: euclidian, linear, etc....., then focus on that partitioning which classifies more new cases in less time. In summary, would it make sense to develop a classification algorithm which can partition a decision space several ways, then focus on the way that correctly classifies new cases fastest?