Xref: utzoo ont.events:1290 uw.talks:16 uw.cs.grad:20
Path: utzoo!utgpu!watmath!watdragon!daemon
From: daemon@watdragon.waterloo.edu (Owner of Many System Processes)
Newsgroups: ont.events,uw.talks,uw.cs.grad
Subject: MASTER'S THESIS PRESENTATION
Keywords: Mr. George Townsend, graduate student,
Message-ID: <16251@watdragon.waterloo.edu>
Date: 6 Sep 89 18:35:50 GMT
Distribution: ont
Organization: U of Waterloo, Ontario
Lines: 64

Dept. of Computer Science, will speak on 
``A Model for Approximate Text Matching.''
From: wlrush@poppy.waterloo.edu (Wenchantress Wench Wendall)
Path: poppy!wlrush

DEPARTMENT OF COMPUTER SCIENCE
UNIVERSITY OF WATERLOO
SEMINAR ACTIVITIES

MASTER'S THESIS PRESENTATION

                    -Monday, September 11, 1989

Mr.   George   Townsend,  graduate  student,  Dept.  of
Computer   Science,   will   speak  on  ``A  Model  For
Approximate Text Matching.''

TIME:                 3:30 p.m.

ROOM:                 DC 1302

ABSTRACT

     This    thesis   introduces    the    concept   of
Approximate  Text  Matching,   a  natural extension  of
approximate string matching.  A  hierarchical model  is
presented   upon which solutions to this problem may be
based.
     Traditional   approaches   to  approximate  string
matching   deal  with   transformations   which,   when
applied   at   the string layer,  cause two strings  to
become   equivalent  at  the   character layer.  If the
transformations   required   to   achieve    this   are
"reasonable"  as  determined  by  some heuristics,  the
two  original strings  are claimed  to be approximately
equal.
     This  traditional  model consists  of two  layers,
the string  layer and  the character  layer, where  the
string  layer  is  viewed as a  sequence of characters.
The   typical   transformations     applied    to   the
string    layer    are  transposition   of  characters,
removal   of  superfluous characters,  and  addition of
missing  characters.   In  the  model   presented here,
these  same notions  are applied at multiple layers  of
the  problem.   "Text"  is   viewed  as  a sequence  of
"fields  of data",  or  phrases. A  "field of data"  is
viewed  as a sequence of "unit strings" or words.  Each
"unit string" is  viewed, in the traditional manner, as
a     sequence    of   characters.   The    traditional
transformations    considered   in  approximate  string
matching  are  applied  at  each  of  these layers in a
general way.
     The  advantages of  the model are twofold.  First,
the  model  understands   the structure  of  the  data,
and    will   therefore   search   for  matches  for  a
particular field in an appropriate place. Secondly, the
failure    of   the  model   at  one  layer   is  often
compensated for at  the next  highest layer.
     The  Oxford  English  Dictionary  was  used  as  a
testing  ground  for  algorithms based upon this model.
Citations  occurring  in  the body were matched against
their  sources  in  the  O.E.D.  bibliography.   On two
samples  of data, correct matches were made over 90% of
the time.