Path: utzoo!utgpu!water!watmath!clyde!bellcore!rutgers!mailrus!uflorida!novavax!proxftl!twwells!bill
From: bill@twwells.uucp (T. William Wells)
Newsgroups: comp.archives
Subject: Comp.archives database format
Message-ID: <116@twwells.uucp>
Date: 24 Oct 88 23:16:58 GMT
Reply-To: bill@twwells.UUCP (T. William Wells)
Organization: None, Ft. Lauderdale
Lines: 374
Approved: bill@twwells.UUCP (T. William Wells)

Contained herein is my first attempt at the database structure which
comp.archives is intended to be the input to. I am also going to
describe the comp.archives postings used to maintain the database.
None of this is cast in stone and critiques are welcome.

Here is the example archive site entry from my previous message.
Following it is a line-by-line description.

NM twwells.UUCP
EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21
AD bill@twwells.UUCP (T. William Wells)
MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311
CO uucp:uucp::twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp
DE This is where comp.archives gets moderated from. I maintain the
DE most up-to-date version of the databases, so if you want
DE them you have to get them directly from me.

NM twwells.UUCP

      This is the site name.

EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

AD bill@twwells.UUCP (T. William Wells)

      This is the person who is responsible for the archive.  He
      may or may not be the uucp, news, or system administrator.
      There can be more than one of these.

MA 781 W. Oakland Pk Blvd #208, Ft. Lauderdale FL 33311

      The mailing address for help or information.  Don't include
      this unless you want snail-mail.  People who mail to this
      address had better include a SASE or e-mail address or forget
      about getting any response.

CO uucp:uucp:~:twwells Any1800-0800 ACU 2400 13059876543 in:-\r-in: arcuucp

      This contains the information needed to access the archive.
      There can be several of these, depending on how many ways
      your site can be accessed.

      Each line starts with a tag that identifies the access method.
      This is used when not all of your archived information is
      available through all paths to your site.  For example, you
      might have a mail based server for small items but require a
      direct link for larger things.  Each item that you list as
      available through your archive has a tag that is used to
      indicate which way it can be accessed.

      There may be more than one line for a single tag. This would
      mean that there is more than one way to get to the same set
      of information.

      The next field describes the access method.  This would be
      something like "uucp", or "ftp", or "mail", or whatever.

      The remaining fields depend on the access method.  Since I am
      only familiar with uucp, I am only going to describe the
      fields for it.  I definitely want input on what is necessary
      for other access methods.

      There are two fields for uucp access.  The first is the path
      name which archive file names are relative to. The second is
      an L.sys entry that would be used to access your site.

DE This is where comp.archives gets moderated from. I maintain the
DE most up-to-date version of the databases, so if you want
DE them you have to get them direct from me.

      This is a short description of your site. You might also
      include any special information about your archives; for
      example, if you are willing to make tapes you would say so
      here.

---

Here is a sample entry for the archived information database.  Note
that I made this up from a cursory examination of Pcomm, don't take
it as gospel.

NM unix-pcomm
VR version 1.1
AU egray@fthood.UUCP (Emmet P. Gray)
MA egray@fthood.UUCP (Emmet P. Gray)
EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21
TT public domain version of ProComm (TM)
KW all-source,public-domain,datacomm
SY any:modem,sysv-unix:termcaps,install
DE Pcomm is a public domain telecommunication program for Unix that
DE is designed to operate similar to the MSDOS program, ProComm.
DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc.  This
DE is a completely new program and contains no ProComm source code.
DE This is not a Datastorm product.

Here is a line-by-line description:

NM unix-pcomm

      The name of the item.  If the item is a program that ports to
      one environment, the name is that environment hyphenated with
      the program name; otherwise it is just the name.  Note that
      this is not intended to be useful by itself, e.g., unix-pcomm
      might eventually also refer to something that has been made
      to work under VMS.  Should there be two items with the same
      name, the later item will have its author's name appended.
      For example, should John Turkey later write a pcomm for
      UNIX, it would be called unix-pcomm-turkey.

VR version 1.1

      Some kind of version stamp.  If the item does not have
      versions, this is the date released or published, or
      something else indicating when the item came into existence.

AU egray@fthood.UUCP (Emmet P. Gray)

      This is the person or persons who wrote the thing.  If there
      is more than one author, use more than one line.

MA egray@fthood.UUCP (Emmet P. Gray)

      This is who is maintaining the item.  If the item is not
      being maintained, don't add this line.  If several people are
      maintaining it, use several lines.  Note that anyone whose
      name is on one of these lines can expect e-mail about the
      item.

EN bill@twwells.UUCP (T. William Wells) 1988 Oct 21

      This is the person responsible for the entry and the date on
      which the entry was added or updated.

TT public domain version of ProComm (TM)

      A title for the item.

KW all-source,public-domain,datacomm

      Keywords describing the item.  Note the `all-source' keyword,
      which means that all the source (other than that of the tools
      mentioned below) needed is included.  Note also the
      public-domain keyword, which indicates that the item is in
      the public domain.

SY any:modem,sysv-unix:termcaps,install

      For each system this item runs on (or must be used on),
      there should be one of these lines.  The fields are:

      1) The hardware it runs on.  If it runs on any hardware
	 which a particular OS runs on, the entry is `any'.
	 Required additional hardware is indicated by
	 :<hardware>.

      2) The OS it runs under.  There are several generic names
	 like the `sysv-unix' above.  Optional OS things which are
	 needed are indicated the same way hardware options are.
	 Also, software which is not listed in this directory which
	 is needed to make this go is listed here.  Multiple
	 entries are separated by semicolons.  For example, if this
	 is a Dbase-II program, you'd have MS-DOS;Dbase-II in this
	 field.

      3) How much effort is needed to make it go.  If following the
	 directions is sufficient, the entry is `install'.

      4) This entry contains any tools, not normally available on
	 your system, which one must have in order to build or use
	 this item. All items which are in this section must also
	 have their own entries in the information directory.

      There may be more than one of these lines, whenever necessary.

DE Pcomm is a public domain telecommunication program for Unix that
DE is designed to operate similar to the MSDOS program, ProComm.
DE ProComm (TM) is copyrighted by Datastorm Technologies, Inc.  This
DE is a completely new program and contains no ProComm source code.
DE This is not a Datastorm product.

      This is a short descrpiton of the item. This should be kept
      brief; putting the man page here is probably not appropriate.

Here is another entry that would go in the information database.

NM free-distribution-database
VR updated continuously
AU bill@twwells.UUCP (T. William Wells)
MA bill@twwells.UUCP (T. William Wells)
EN bill@twwells.UUCP (T. William Wells) 19880926
TT Database of freely distributable, electronically accessible information.
KW database,public-domain
SY any,any,install
DE This database is constructed from the information that passes
DE through comp.archives.  It contains information on any software,
DE databases, documents, or what-have-you, that is both freely
DE distributable and available electronically.  "Freely
DE distributable" means that, if you have a copy of the item, you
DE can (at least) make exact copies and give them away, and you
DE don't have to tell the owner of the item (if any) that you have
DE done so.  "Electronically available" means that it is either
DE accessible through a publicly accessible network, or is available
DE by a means that does not involve paying a fee to the
DE distributor.  This information is provided as a free service and
DE there is *no one* guaranteeing that any of it is accurate or
DE useful.  Use it your own risk.

---

Here is the meat of the database: the index of things available from
each archive site.  This is the format:

archive-name;version;site-name;access-type;access-handle;date;tools;comments

      `Archive-name' and `version' match entries in the main
      database.  If this file is not in the database, leave the
      fields blank.  Note that this means that you can make
      available archive information about things not in the
      directory; however, this practiced is discouraged.

      `Site-name' is the name of the site, as recorded in the site
      database.

      `Access-type' is one of the access tags specified in the site
      entry.  Note that this is in the style of UNIX file names:
      wild cards are permitted.

      `Access-handle' is used with the information from the site
      entry to construct the request from the archive.  For
      example, using uucp, if the site entry contained
      /usr/archives as the path to which files names are relative,
      and this field contains foobar.shar, then the path name you
      should use to get this item is /usr/archives/foobar.shar.

      `Date' is the date which this entry was added to the
      database.

      `Tools' is a list of programs needed to unarchive the file;
      each must be a name in the info database.  Standard system
      utilities are not listed.

      `Comments' is anything useful to add.

For example, suppose I have pcomm sitting around in my directories.
I could have these records:

unix-pcomm;version 1.1;twwells;*;pcomm.1.shar.Z;1988 Oct 21;compress;part 1
unix-pcomm;version 1.1;twwells;*;pcomm.2.shar.Z;1988 Oct 21;compress;part 2
unix-pcomm;version 1.1;twwells;*;pcomm.3.shar.Z;1988 Oct 21;compress;part 3
unix-pcomm;version 1.1;twwells;*;pcomm.4.shar.Z;1988 Oct 21;compress;part 4
unix-pcomm;version 1.1;twwells;*;pcomm.5.shar.Z;1988 Oct 21;compress;part 5
unix-pcomm;version 1.1;twwells;*;pcomm.6.shar.Z;1988 Oct 21;compress;part 6
unix-pcomm;version 1.1;twwells;*;pcomm.7.shar.Z;1988 Oct 21;compress;part 7
unix-pcomm;version 1.1;twwells;*;pcomm.8.shar.Z;1988 Oct 21;compress;part 8
unix-pcomm;version 1.1;twwells;*;pcomm.p1.shar.Z;1988 Oct 21;compress;patch 1
unix-pcomm;version 1.1;twwells;*;pcomm.p2.shar.Z;1988 Oct 21;compress;patch 2
unix-pcomm;version 1.1;twwells;*;pcomm.p3.shar.Z;1988 Oct 21;compress;patch 3

This says that

    various pieces of unix-pcomm, version 1.1 are available from my site
    they can be accessed through any way that my site can be accessed
    the various pieces of it can be accessed with names beginning with pcomm
    the entries were added on October 21, 1988
    you need compress to unarchive any of it
    parts 1-8 and patches 1-5 are available

Now, suppose that I had a list of local BBS's that I was willing
to make available. It would have an entry like:

;;twwells;*;bbslist;2001 Jan 1;;bbs systems in south Florida

This says that the file bbslist is available but that it has no entry
in the information database.

---

That leaves the problem of how to distribute this database. Here
are my goals:

      1) To minimize the amount of information retransmitted
	 through the newsgroup. In an ideal world, the data would
	 get transmitted once, and everyone would thereafter query
	 archive sites for current copies.

      2) To minimize the delay in getting the information out.
	 This means avoiding batching the data; it would not be
	 very nice to hold some archive information just because no
	 one else was posting at that time.

      3) To minimize the pain of maintaining a database from the
	 information which flows through comp.archives.

The first one is the stickiest problem. If I never retransmitted any
data, sites which want to start a database would have to find someone
who was willing to let them have a copy of the database.  Where would
they find this information?  This means that I need to, at least,
periodically post a minimal database of sites that are archives for
the database.

Now, how do I best serve the needs of the guy who just has one thing
he is looking for? If I send the data just once, he is unlikely to
see it. The alternative is to send it periodically, with reasonably
long expiration dates, so that he can look on his system.

Anyway, for now, I will do the latter; if the volume gets too high,
then I'll look into some other method.

The second item means posting the information as soon as it comes in
and has been verified.  The main drawback to this is that sometimes
the information is incorrectly sent. Putting a delay in the system
results in much of this error being corrected before it gets out.  My
own feeling is to make updating the system reasonably painless, so
that if errors like this occur, they can be fixed reasonably easily.

The third item requires minimizing the information transmitted which
is used to update the database (a worthy goal of its own) and
minimizing the programming needed to maintain the database.  The
first suggests sending updates as increments: if a site adds or
deletes something, only that addition or deletion gets sent, not the
whole thing. In the interests of keeping the database simple, the
whole database should be maintained in ASCII and be maintainable with
standard UNIX tools.  Of course, it would be even better if the tools
needed to maintain this could be found through the database.

----

That leads to the problem of how to maintain the database.  First,
the subject line is used to indicate that this is a database update
message.  Such subject line starts with the string 'DB:'. This should
make it reasonable to separate these entries from the others.  The
remainder of the subject line may be used for any additional comments
I might wish to add.

The body of the message contains the database update commands.

Commands to add data look like:

      @ADD <database>

and the following data is what is to be added.  <database> is one of
the strings INFO, SITE, or INDEX. The new data is terminated by a
blank line.

Commands to delete data look like:

      @DEL <database> <key>

The key depends on what is being deleted. Deletions from the
information database just use the item name. Deletions from the site
database use the site name. Deletions from the archive index use the
site name, the access method, and the access handle for the line to be
deleted.

There is a special command to delete all index entries for a site;
its form is:

      @DELALL INDEX <site>

All of this should be reasonably easy to do; I roughed out a shell
script using sed, join, and comm that would handle this; though it
would be SLOW. However, it would be reasonable easy to write a simple
program that would be MUCH faster.

---

Ok, guys, its your turn.

---
Bill
{uunet|novavax}!proxftl!twwells!bill