Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ll-xn!cit-vax!amdahl!rtech!bobm From: bobm@rtech.UUCP (Bob Mcqueer) Newsgroups: net.news.adm Subject: Re: a foolish consistency is the hobgoblin of little minds Message-ID: <427@rtech.UUCP> Date: Sat, 30-Aug-86 02:43:13 EDT Article-I.D.: rtech.427 Posted: Sat Aug 30 02:43:13 1986 Date-Received: Sat, 30-Aug-86 10:04:21 EDT References: <15454@ucbvax.BERKELEY.EDU> Organization: Relational Technology Inc, Alameda CA Lines: 49 []---- Hmmmmm. What a coincidence. I've been toying around lately with putting usenet stuff into a relational database (look who I work for and guess which one) to be able to do nice sophisticated queries thereon. I decided that the #L lines were being entered in a pretty "loose" fashion. For what its worth, my decisions on how to make sense out of a reasonable number of those lines: 1) I assign an accuracy 0 - 4. 0 means I couldn't make sense out of the line, and you get a default based on country / region (default to the South Pole if I don't recognize the country). 1 means degrees accuracy, 3 means degrees / minutes and 4 degrees / minutes / seconds. 2 indicates better than "degrees", but "city" was attached. The rationale is that a city is generally bigger than a minute, but even the largest metropolitan areas are only on the order of one degree (of course the further north they're situated, the easier for them to cover multiple degrees of longitude). 2) I split into two pieces on the "/" first. Most people seem to be obeying this one. 3) for each piece, I look for "NnSsWwEe" first, assuming "N" and "W" if I don't find them (I know - geographical chauvinism). Then I use strtok() to to grab things delimited by " \tNnSsWwEe\"'", and take the numbers as degrees / minutes / seconds, stopping early if I run out of numeric tokens. Then I see if the stopping token was "city". 4) the whole specification then gets the lower accuracy of the two pieces. This represents my guess for something that will work a resonable amount of the time. The "accuracy" lets you filter the less reliable ones depending on what you're doing with the data. If you want something even MORE hopeless try to parse reasonable sub-pieces out of the telephone number lines. Not only do you have non-uniformity of entry, but what constitutes an interesting sub-part (such as US / Canadian area codes which may be more useful than geographic information sometimes) differs from country to country, and there are various alternate carriers, extension specifications, etc. I'm trying to fish out the area codes, and it seems to me that those rules may also produce a useful sub-piece of the phone number for other countries as well, but I'm not sure, since I don't know what the "breaks" in those phone numbers mean. THEN, what about the date that's supposed to be entered in the #W line...... Bob McQueer -- {amdahl, sun, mtxinu, hoptoad, cpsc6a}!rtech!bobm