Path: utzoo!utgpu!water!watmath!clyde!bellcore!rutgers!mailrus!uflorida!novavax!proxftl!twwells!bill From: bill@twwells.uucp (T. William Wells) Newsgroups: comp.archives Subject: Comp.archives database format Message-ID: <116@twwells.uucp> Date: 24 Oct 88 23:16:58 GMT Reply-To: bill@twwells.UUCP (T. William Wells) Organization: None, Ft. Lauderdale Lines: 374 Approved: bill@twwells.UUCP (T. William Wells) Contained herein is my first attempt at the database structure which comp.archives is intended to be the input to. I am also going to describe the comp.archives postings used to maintain the database. None of this is cast in stone and critiques are welcome. Here is the example archive site entry from my previous message. Following it is a line-by-line description. NM twwells.UUCP EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21 AD bill@twwells.UUCP (T. William Wells) MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311 CO uucp:uucp::twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp DE This is where comp.archives gets moderated from. I maintain the DE most up-to-date version of the databases, so if you want DE them you have to get them directly from me. NM twwells.UUCP This is the site name. EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21 This is the person responsible for the entry and the date on which the entry was added or updated. AD bill@twwells.UUCP (T. William Wells) This is the person who is responsible for the archive. He may or may not be the uucp, news, or system administrator. There can be more than one of these. MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311 The mailing address for help or information. Don't include this unless you want snail-mail. People who mail to this address had better include a SASE or e-mail address or forget about getting any response. CO uucp:uucp:~:twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp This contains the information needed to access the archive. There can be several of these, depending on how many ways your site can be accessed. Each line starts with a tag that identifies the access method. This is used when not all of your archived information is available through all paths to your site. For example, you might have a mail based server for small items but require a direct link for larger things. Each item that you list as available through your archive has a tag that is used to indicate which way it can be accessed. There may be more than one line for a single tag. This would mean that there is more than one way to get to the same set of information. The next field describes the access method. This would be something like "uucp", or "ftp", or "mail", or whatever. The remaining fields depend on the access method. Since I am only familiar with uucp, I am only going to describe the fields for it. I definitely want input on what is necessary for other access methods. There are two fields for uucp access. The first is the path name which archive file names are relative to. The second is an L.sys entry that would be used to access your site. DE This is where comp.archives gets moderated from. I maintain the DE most up-to-date version of the databases, so if you want DE them you have to get them direct from me. This is a short description of your site. You might also include any special information about your archives; for example, if you are willing to make tapes you would say so here. --- Here is a sample entry for the archived information database. Note that I made this up from a cursory examination of Pcomm, don't take it as gospel. NM unix-pcomm VR version 1.1 AU egray@fthood.UUCP (Emmet P. Gray) MA egray@fthood.UUCP (Emmet P. Gray) EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21 TT public domain version of ProComm (TM) KW all-source,public-domain,datacomm SY any:modem,sysv-unix:termcaps,install DE Pcomm is a public domain telecommunication program for Unix that DE is designed to operate similar to the MSDOS program, ProComm. DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc. This DE is a completely new program and contains no ProComm source code. DE This is not a Datastorm product. Here is a line-by-line description: NM unix-pcomm The name of the item. If the item is a program that ports to one environment, the name is that environment hyphenated with the program name; otherwise it is just the name. Note that this is not intended to be useful by itself, e.g., unix-pcomm might eventually also refer to something that has been made to work under VMS. Should there be two items with the same name, the later item will have its author's name appended. For example, should John Turkey later write a pcomm for UNIX, it would be called unix-pcomm-turkey. VR version 1.1 Some kind of version stamp. If the item does not have versions, this is the date released or published, or something else indicating when the item came into existence. AU egray@fthood.UUCP (Emmet P. Gray) This is the person or persons who wrote the thing. If there is more than one author, use more than one line. MA egray@fthood.UUCP (Emmet P. Gray) This is who is maintaining the item. If the item is not being maintained, don't add this line. If several people are maintaining it, use several lines. Note that anyone whose name is on one of these lines can expect e-mail about the item. EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21 This is the person responsible for the entry and the date on which the entry was added or updated. TT public domain version of ProComm (TM) A title for the item. KW all-source,public-domain,datacomm Keywords describing the item. Note the `all-source' keyword, which means that all the source (other than that of the tools mentioned below) needed is included. Note also the public-domain keyword, which indicates that the item is in the public domain. SY any:modem,sysv-unix:termcaps,install For each system this item runs on (or must be used on), there should be one of these lines. The fields are: 1) The hardware it runs on. If it runs on any hardware which a particular OS runs on, the entry is `any'. Required additional hardware is indicated by :. 2) The OS it runs under. There are several generic names like the `sysv-unix' above. Optional OS things which are needed are indicated the same way hardware options are. Also, software which is not listed in this directory which is needed to make this go is listed here. Multiple entries are separated by semicolons. For example, if this is a Dbase-II program, you'd have MS-DOS;Dbase-II in this field. 3) How much effort is needed to make it go. If following the directions is sufficient, the entry is `install'. 4) This entry contains any tools, not normally available on your system, which one must have in order to build or use this item. All items which are in this section must also have their own entries in the information directory. There may be more than one of these lines, whenever necessary. DE Pcomm is a public domain telecommunication program for Unix that DE is designed to operate similar to the MSDOS program, ProComm. DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc. This DE is a completely new program and contains no ProComm source code. DE This is not a Datastorm product. This is a short descrpiton of the item. This should be kept brief; putting the man page here is probably not appropriate. Here is another entry that would go in the information database. NM free-distribution-database VR updated continuously AU bill@twwells.UUCP (T. William Wells) MA bill@twwells.UUCP (T. William Wells) EN bill@twwells.UUCP (T. William Wells) 19880926 TT Database of freely distributable, electronically accessible information. KW database,public-domain SY any,any,install DE This database is constructed from the information that passes DE through comp.archives. It contains information on any software, DE databases, documents, or what-have-you, that is both freely DE distributable and available electronically. "Freely DE distributable" means that, if you have a copy of the item, you DE can (at least) make exact copies and give them away, and you DE don't have to tell the owner of the item (if any) that you have DE done so. "Electronically available" means that it is either DE accessible through a publicly accessible network, or is available DE by a means that does not involve paying a fee to the DE distributor. This information is provided as a free service and DE there is *no one* guaranteeing that any of it is accurate or DE useful. Use it your own risk. --- Here is the meat of the database: the index of things available from each archive site. This is the format: archive-name;version;site-name;access-type;access-handle;date;tools;comments `Archive-name' and `version' match entries in the main database. If this file is not in the database, leave the fields blank. Note that this means that you can make available archive information about things not in the directory; however, this practiced is discouraged. `Site-name' is the name of the site, as recorded in the site database. `Access-type' is one of the access tags specified in the site entry. Note that this is in the style of UNIX file names: wild cards are permitted. `Access-handle' is used with the information from the site entry to construct the request from the archive. For example, using uucp, if the site entry contained /usr/archives as the path to which files names are relative, and this field contains foobar.shar, then the path name you should use to get this item is /usr/archives/foobar.shar. `Date' is the date which this entry was added to the database. `Tools' is a list of programs needed to unarchive the file; each must be a name in the info database. Standard system utilities are not listed. `Comments' is anything useful to add. For example, suppose I have pcomm sitting around in my directories. I could have these records: unix-pcomm;version 1.1;twwells;*;pcomm.1.shar.Z;1988 Oct 21;compress;part 1 unix-pcomm;version 1.1;twwells;*;pcomm.2.shar.Z;1988 Oct 21;compress;part 2 unix-pcomm;version 1.1;twwells;*;pcomm.3.shar.Z;1988 Oct 21;compress;part 3 unix-pcomm;version 1.1;twwells;*;pcomm.4.shar.Z;1988 Oct 21;compress;part 4 unix-pcomm;version 1.1;twwells;*;pcomm.5.shar.Z;1988 Oct 21;compress;part 5 unix-pcomm;version 1.1;twwells;*;pcomm.6.shar.Z;1988 Oct 21;compress;part 6 unix-pcomm;version 1.1;twwells;*;pcomm.7.shar.Z;1988 Oct 21;compress;part 7 unix-pcomm;version 1.1;twwells;*;pcomm.8.shar.Z;1988 Oct 21;compress;part 8 unix-pcomm;version 1.1;twwells;*;pcomm.p1.shar.Z;1988 Oct 21;compress;patch 1 unix-pcomm;version 1.1;twwells;*;pcomm.p2.shar.Z;1988 Oct 21;compress;patch 2 unix-pcomm;version 1.1;twwells;*;pcomm.p3.shar.Z;1988 Oct 21;compress;patch 3 This says that various pieces of unix-pcomm, version 1.1 are available from my site they can be accessed through any way that my site can be accessed the various pieces of it can be accessed with names beginning with pcomm the entries were added on October 21, 1988 you need compress to unarchive any of it parts 1-8 and patches 1-5 are available Now, suppose that I had a list of local BBS's that I was willing to make available. It would have an entry like: ;;twwells;*;bbslist;2001 Jan 1;;bbs systems in south Florida This says that the file bbslist is available but that it has no entry in the information database. --- That leaves the problem of how to distribute this database. Here are my goals: 1) To minimize the amount of information retransmitted through the newsgroup. In an ideal world, the data would get transmitted once, and everyone would thereafter query archive sites for current copies. 2) To minimize the delay in getting the information out. This means avoiding batching the data; it would not be very nice to hold some archive information just because no one else was posting at that time. 3) To minimize the pain of maintaining a database from the information which flows through comp.archives. The first one is the stickiest problem. If I never retransmitted any data, sites which want to start a database would have to find someone who was willing to let them have a copy of the database. Where would they find this information? This means that I need to, at least, periodically post a minimal database of sites that are archives for the database. Now, how do I best serve the needs of the guy who just has one thing he is looking for? If I send the data just once, he is unlikely to see it. The alternative is to send it periodically, with reasonably long expiration dates, so that he can look on his system. Anyway, for now, I will do the latter; if the volume gets too high, then I'll look into some other method. The second item means posting the information as soon as it comes in and has been verified. The main drawback to this is that sometimes the information is incorrectly sent. Putting a delay in the system results in much of this error being corrected before it gets out. My own feeling is to make updating the system reasonably painless, so that if errors like this occur, they can be fixed reasonably easily. The third item requires minimizing the information transmitted which is used to update the database (a worthy goal of its own) and minimizing the programming needed to maintain the database. The first suggests sending updates as increments: if a site adds or deletes something, only that addition or deletion gets sent, not the whole thing. In the interests of keeping the database simple, the whole database should be maintained in ASCII and be maintainable with standard UNIX tools. Of course, it would be even better if the tools needed to maintain this could be found through the database. ---- That leads to the problem of how to maintain the database. First, the subject line is used to indicate that this is a database update message. Such subject line starts with the string 'DB:'. This should make it reasonable to separate these entries from the others. The remainder of the subject line may be used for any additional comments I might wish to add. The body of the message contains the database update commands. Commands to add data look like: @ADD and the following data is what is to be added. is one of the strings INFO, SITE, or INDEX. The new data is terminated by a blank line. Commands to delete data look like: @DEL The key depends on what is being deleted. Deletions from the information database just use the item name. Deletions from the site database use the site name. Deletions from the archive index use the site name, the access method, and the access handle for the line to be deleted. There is a special command to delete all index entries for a site; its form is: @DELALL INDEX All of this should be reasonably easy to do; I roughed out a shell script using sed, join, and comm that would handle this; though it would be SLOW. However, it would be reasonable easy to write a simple program that would be MUCH faster. --- Ok, guys, its your turn. --- Bill {uunet|novavax}!proxftl!twwells!bill