Path: utzoo!attcan!uunet!husc6!mailrus!csd4.milw.wisc.edu!indri!rhesus!bin
From: bin@primate.wisc.edu (Brain in Neutral)
Newsgroups: comp.text
Subject: How should uniqbib look for "near" duplicates?
Message-ID: <455@rhesus.primate.wisc.edu>
Date: 22 Dec 88 18:21:59 GMT
Organization: UW-Madison Primate Center
Lines: 24

A short while ago, I posted "uniqbib", a program for eliminating
duplicates from bibliographic databases in refer format.  My motivation
originally was too allow the results of several overlapping lookbib
queries to filter those references that were hits on more than one
query.  In such cases you know the entries will be identical.

I've been in correspondence now with several people who have expressed
an interest in looking for "near" duplicates, e.g., such as might arise
when entries are added to bibliographic databases by different people.
In this instance, entries may be "the same" to a human, but actually
slightly different - a journal title might be abbreviated by one person
and not the other.

Several strategies for finding near duplicates have been suggested to
me, and I've thought of several others.  I'm asking for comment from
the net on this issue.  Given two entries, how would you determine
whether they are the same.  (phrased another way, how you you estimate
the distance between two entries?)

I would prefer that responses be posted.  Thanks.

Paul DuBois
dubois@primate.wisc.edu	rhesus!dubois
bin@primate.wisc.edu	rhesus!bin