Path: utzoo!censor!geac!torsqnt!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!hplabs!hpcc05!hpwrce!kingsley
From: kingsley@hpwrce.HP.COM (Kingsley Morse)
Newsgroups: comp.ai
Subject: Re: Re: The AI Breakthough -- What It Will Be Like !!!
Message-ID: <1870008@hpwrce.HP.COM>
Date: 12 Nov 90 21:32:34 GMT
References: <1990Nov5.145716.19427@cec1.wustl.edu>
Organization: HP Western Response Center
Lines: 34

ddj7203@cec1.wustl.edu (David D. Jensen)  writes:

> How well does the algorithm classify new cases?  
> just take a pool of (what we think are) representative problems
> and test the various algorithms.

This is a good point that I've been struggling with in my research. Indeed,
a classification algorithm's ability to generalize what it learned from the 
training set to correctly classify new cases IS important. But frankly, I 
don't know how to match a classification algorithm with a representative 
problem. In other words, if I have a classification problem in mind, and 
training data in hand, how do I know which classification algorithm will best 
classify new cases? It seems to me that testing classification algorithms
with representative problems would be unreliable; there's always another
classification algorithm that partitions the space so it is more like the 
the test data. For example, the nearest neighbor algorithm will work well
with test data that is centrally clustered in euclidian space. Multiple
regression will work well with test data that is partitioned into linear
regions. But how do you know in advance if the test data is clustered in 
euclidian spheres, linear regions, etc....? Even then, it's a safe bet that a 
nearest neighbor algorithm could approximate a multiple regression algorithm, 
or vica versa, given enough training patterns. This is
similar to a Turing machine, in which a simple cpu can emulate any other cpu
given enough time. In this case, we can assume that most classification 
algorithms can emulate most other classification algorithms, given enough 
time and training patterns. So although any classification algorithm will do,
given enough time and training patterns, perhaps we should develop an 
algorithm which is capable of partitioning the data space several ways, ie:
euclidian, linear, etc....., then focus on that partitioning which classifies
more new cases in less time. 

In summary, would it make sense to develop a classification algorithm which 
can partition a decision space several ways, then focus on the way that
correctly classifies new cases fastest?