Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!seismo!ll-xn!cit-vax!amdahl!rtech!bobm
From: bobm@rtech.UUCP (Bob Mcqueer)
Newsgroups: net.news.adm
Subject: Re: a foolish consistency is the hobgoblin of little minds
Message-ID: <427@rtech.UUCP>
Date: Sat, 30-Aug-86 02:43:13 EDT
Article-I.D.: rtech.427
Posted: Sat Aug 30 02:43:13 1986
Date-Received: Sat, 30-Aug-86 10:04:21 EDT
References: <15454@ucbvax.BERKELEY.EDU>
Organization: Relational Technology Inc, Alameda CA
Lines: 49

[]----

Hmmmmm.  What a coincidence.  I've been toying around lately with putting
usenet stuff into a relational database (look who I work for and guess
which one) to be able to do nice sophisticated queries thereon.  I decided
that the #L lines were being entered in a pretty "loose" fashion.  For what
its worth, my decisions on how to make sense out of a reasonable number of
those lines:

1) I assign an accuracy 0 - 4.  0 means I couldn't make sense out of the
line, and you get a default based on country / region (default to the
South Pole if I don't recognize the country).  1 means degrees accuracy,
3 means degrees / minutes and 4 degrees / minutes / seconds.  2 indicates
better than "degrees", but "city" was attached.  The rationale is that
a city is generally bigger than a minute, but even the largest metropolitan
areas are only on the order of one degree (of course the further north
they're situated, the easier for them to cover multiple degrees of
longitude).

2) I split into two pieces on the "/" first.  Most people seem to be
obeying this one.

3) for each piece, I look for "NnSsWwEe" first, assuming "N" and "W" if I
don't find them (I know - geographical chauvinism).  Then I use strtok() to
to grab things delimited by " \tNnSsWwEe\"'", and take the numbers as
degrees / minutes / seconds, stopping early if I run out of numeric tokens.
Then I see if the stopping token was "city".

4) the whole specification then gets the lower accuracy of the two pieces.

This represents my guess for something that will work a resonable amount
of the time.  The "accuracy" lets you filter the less reliable ones
depending on what you're doing with the data.

If you want something even MORE hopeless try to parse reasonable sub-pieces
out of the telephone number lines.  Not only do you have non-uniformity of
entry, but what constitutes an interesting sub-part (such as US / Canadian
area codes which may be more useful than geographic information sometimes)
differs from country to country, and there are various alternate carriers,
extension specifications, etc.  I'm trying to fish out the area codes, and
it seems to me that those rules may also produce a useful sub-piece of the
phone number for other countries as well, but I'm not sure, since I don't
know what the "breaks" in those phone numbers mean.

THEN, what about the date that's supposed to be entered in the #W line......

Bob McQueer
-- 
{amdahl, sun, mtxinu, hoptoad, cpsc6a}!rtech!bobm