Xref: utzoo ont.events:1290 uw.talks:16 uw.cs.grad:20 Path: utzoo!utgpu!watmath!watdragon!daemon From: daemon@watdragon.waterloo.edu (Owner of Many System Processes) Newsgroups: ont.events,uw.talks,uw.cs.grad Subject: MASTER'S THESIS PRESENTATION Keywords: Mr. George Townsend, graduate student, Message-ID: <16251@watdragon.waterloo.edu> Date: 6 Sep 89 18:35:50 GMT Distribution: ont Organization: U of Waterloo, Ontario Lines: 64 Dept. of Computer Science, will speak on ``A Model for Approximate Text Matching.'' From: wlrush@poppy.waterloo.edu (Wenchantress Wench Wendall) Path: poppy!wlrush DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF WATERLOO SEMINAR ACTIVITIES MASTER'S THESIS PRESENTATION -Monday, September 11, 1989 Mr. George Townsend, graduate student, Dept. of Computer Science, will speak on ``A Model For Approximate Text Matching.'' TIME: 3:30 p.m. ROOM: DC 1302 ABSTRACT This thesis introduces the concept of Approximate Text Matching, a natural extension of approximate string matching. A hierarchical model is presented upon which solutions to this problem may be based. Traditional approaches to approximate string matching deal with transformations which, when applied at the string layer, cause two strings to become equivalent at the character layer. If the transformations required to achieve this are "reasonable" as determined by some heuristics, the two original strings are claimed to be approximately equal. This traditional model consists of two layers, the string layer and the character layer, where the string layer is viewed as a sequence of characters. The typical transformations applied to the string layer are transposition of characters, removal of superfluous characters, and addition of missing characters. In the model presented here, these same notions are applied at multiple layers of the problem. "Text" is viewed as a sequence of "fields of data", or phrases. A "field of data" is viewed as a sequence of "unit strings" or words. Each "unit string" is viewed, in the traditional manner, as a sequence of characters. The traditional transformations considered in approximate string matching are applied at each of these layers in a general way. The advantages of the model are twofold. First, the model understands the structure of the data, and will therefore search for matches for a particular field in an appropriate place. Secondly, the failure of the model at one layer is often compensated for at the next highest layer. The Oxford English Dictionary was used as a testing ground for algorithms based upon this model. Citations occurring in the body were matched against their sources in the O.E.D. bibliography. On two samples of data, correct matches were made over 90% of the time.