Xref: utzoo comp.ai:2390 comp.lang.prolog:1333 Path: utzoo!utgpu!attcan!uunet!husc6!bbn!oberon!sm.unisys.com!aero!mcguire From: mcguire@antares.aero.org (Rod McGuire) Newsgroups: comp.ai,comp.lang.prolog Subject: Re: Concept Learning & ID3 (Quinlan) - in prolog Message-ID: <39500@aero.ARPA> Date: 15 Oct 88 01:27:52 GMT References: <395@uiag.UUCP> Sender: news@aero.ARPA Reply-To: mcguire@aero.aerospace.org (Rod McGuire) Organization: The Aerospace Corporation, El Segundo, CA Lines: 60 In article <395@uiag.UUCP> gerrit@uiag.UUCP (Cap Gerrit) writes: >Does anybody out there in the world has an implementation of >the ID3 algorithm of Quinlan Prolog is one of the worst possible languages in which to write the ID3 algorithm. I don't know if it can be written to be efficient. The problem is that one needs to count up statistics in parallel and then somewhat randomly access these statistics. While in prolog it is possible to kludge up arrays with mutable elements, I will bet that the overhead of this technique makes a prolog implementation slower by a factor of at least 100 to say a fortran implementation. Since the ID3 algorithm is usually applied to a moderate amount of data (let's say, to construct a tree discriminating 10 classes from analysis of 5000 attribute vectors each with 20 attributes that can take on 1 of 5 values), I think that this performance difference can make prolog implementation unusable. Also, I think any prolog version that strives for efficiency is likely to be ugly and far removed from the functional specification of the algorithm. However I would love to see an analysis that proves me wrong. Below, I present in lisp the central part of the ID3 algorithm - the definition of the metric B(a U) which gives a value for the goodness of attribute "a" as a discriminant for the set of class-attribute-vectors "U". Following that is an array-based implementation for the time consuming parts. Let U be a set of class-attribute-vectors where each element "u" is a "cav" data-structure defined as: (cav-class u) = c, the class determined by vector u. (range 1 to nc) (cav-av u a) = v, the value for attribute a in vector u. (range 1 to nv) ; the metric B for splitting U on attribute "a" is defined as: (defun (B U a) (/ (sum (v 1 nv) ; sum for v=1 to nv (- (sum (c 1 nc) (* (N c a v U) (log (N c a v U)))) (* (S a v U) (log (S a v U))))) (size-of U))) where (S a v U) = number of elements u in U s.t. (cav-av u a) = v and (N c a v U) = number of elements u in U s.t. (cav-av u a) = v & (cav-class u) = c In order to avoid processing the elements in U over and over again, it is reasonable to pre-compute N (as below) and define S in terms of N. (defvar N (make-array (list nc na nv))) (loop for v in U do (loop for a from 1 to na do (increment (aref N (cav-class v) a (cav-av v a)))))