Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!diplodocus.cis.ohio-state.edu!jgreely
From: jgreely@diplodocus.cis.ohio-state.edu (J Greely)
Newsgroups: comp.sys.next
Subject: Re: NeXT's Digital Library
Message-ID: <30085@tut.cis.ohio-state.edu>
Date: 27 Dec 88 21:39:19 GMT
References: <19728@ames.arc.nasa.gov> <5037@phoenix.Princeton.EDU>
Sender: news@tut.cis.ohio-state.edu
Organization: THE Ohio State University, CIS Dept.
Lines: 161

In article <5037@phoenix.Princeton.EDU> levy@Princeton.EDU
 (Silvio Levy) writes:
[in reply to "Mike of the silly return address" stating that he
 found "all 3 entries" of a word in the Librarian]

>Huh?  I'm mystified.  What do you mean by ``all 3 entries''?  Using
>the UNIX utility `grep' I found 17.

Yes, boys and girls, the correct phrase is all *indexed* entries of
a word.  Actually, to be more precise, I should say, "all files for
which a word is indexed", since the indexing is at the file level.
Indexing in general is a very-beta operation, and the current
scheme is listed in the release notes with:

	This set of tools is not supported.  It will change
	between now and the 1.0 release, but it does give a
	flavor of things to come.

Since the indexing library is at the heart of the lookup problems,
simply bear with it until it is replaced by a better scheme.

  Actually, what's there is very nice.  The db library is dbm done
right, and the idea behind pword is excellent (although its current
reliance on modern english is unfortunate; this is one of the major
reasons why the indexing in Shakespeare isn't as good as it could
be).  I have great hopes that db will eventually find its way out
into the world (I'd love to work over everything around here that
relies on dbm, and insert db instead.  This would probably solve
several of our problems with yp).

>You can only search for words, not for strings or phrases.  
>This means if to find out where S. wrote ``To be or not to be'', 
>you'd have to wade through thousands of occurrences of ``to'',
>``be'', ``or'' or ``not''.  But read on.

  This is a combination of things.  Do you really want all
occurrences of "to"?  Quick check shows there to be more than 16000
of them, scattered throughout over 6000 files.  Common noise words
are eliminated from the index as a design decision.  As for the
inability to search for a phrase, this is acknowledged as a
limitation in the release notes.

  Also, the above statement is not quite true.  You can search for
	<word> ["and"|"or"|"and not" <word> ...]
which, if the words you want are indexed, will narrow the search
for you.  My stock example is locating the line "Ready, so please
your grace" in The Merchant of Venice.  Not a very important line,
but it stuck in my memory from when we performed the play.  The
only word that is indexed is "grace", which is occurs 75 times.
The one (reasonable) search that will uniquely locate it is
"merchant and grace" (Merchant of Venice, Act 4, Scene 1, second
line).

>Apparently very common words cannot be used as search keys at all--
>you get a ``0 found'' response.  This is the case with the four words
>mentioned above.  Together with feature #1, this means that the Digital
>Librarian simply won't locate S.'s most famous quotation.

Correct.  At present, that quote (as well as several others I've
tried) cannot be found from within the Library as is.  However, if
you know any of the surrounding context, you're better off.  I
happen to remember that the line comes from Hamlet, and that the
quote continues with "that is the question. Whether 'tis
nobler...".  Searching for "nobler" will return 19 files, while
"hamlet and nobler" will return the correct section (Hamlet, Act 3,
Scene 1).  From there, a Find on "nobler" will put you at the
correct location in the file.

  Mind you, you'll never find "Now is the winter of our discontent
made glorious summer by this son of York", unless you know that
it's the first two lines of Richard III.  Incidentally, Library
reports this line as "...son of York", while Quotations claims that
it's "...sun of York".  Typo, anyone?

>UNDESIRABLE FEATURE #3:
[indexing stores the first line(s) of the file, rather than the
context of the match]

Agreed.  The context would be more useful, but I don't think this
will change.  The index is built at the file level, so all it knows
is that the word is important enough to be indexed for that file.
If it returned context, it would be the context of the first entry,
and not necessarily the one you want.

>UNDESIRABLE FEATURE #4:
[embedded rtf, rather than something brighter]

This looks like a feature, since low-level encoding requires less
intelligence than full TeX-like macros.  Not having any
documentation on the Microsoft RTF format, I can't say whether it
is capable of more sophisticated (read that, "higher level")
formatting.

>Now for the bugs:
>
>BUG #1:
[bug, feature, same difference]

>Not all occurrences of a word are found -- far from it.

I recommend to you the manual page for "pword".  This will help
clarify how the indexing is currently done.  The object is to index
all *significant* words, based on the surrounding context.  A
document with frequent mention of horses is more likely to have
"horse" indexed than one where it's only mentioned once.  Note that
the documentation for pword is slightly out of date, and will
hopefully be correct by 0.9 (for the correct options, use "pword
-:").

One other problem is picking Shakespeare for this discussion.  The
frequency tables used for the indexing appear to be the Modern
English version, rather than one more appropriate for the work.  In
particular, the stop list does not include noise words like thee,
thy, thou, etc., instead indexing them quite heavily ("thou" is
indexed 315 times, for example).

>BUG #2:
>Treatment of plurals, etc. is inconsistent.

Words are "singularized", but no mention is made of the technique
used.  It is quite likely that the method currently used isn't as
bright as one might hope.

Now, to toss in a few of my own (my complete list is a bit too
large to post, so I'll limit myself to a few things you didn't
mention about the Library):

1) A lower-case search string will perform a case-insensitive
   search, while an upper-case character will force an exact match.
   Nice in theory, but it doesn't work.  Searching (in Shakespeare)
   for "Merchant and grace" will return all 75 matches for "grace",
   while "merchant and grace" will return the unique match that I'm
   looking for.

2) There is no way to pull up an arbitrary file into the Library,
   except as the result of a search.  For example, if Act 4, Scene
   1 comes up as the result of a search, I cannot simply proceed to
   the Scene 2 if I wish to continue reading.  I can open an Edit
   window containing it, but I can't pull it into the Library
   unless I can match it with a search.  This is the most serious
   limitation of the program for me, and the one I most want to see
   changed by 1.0.  I want to be able to browse through the files
   contained in the current database, without leaving the Library.

3) The target field is shared by the Search and Find buttons, but
   not by the Open button, which instead pulls up a Browser window.
   Better yet, Search understands multiword targets, while Find
   will attempt to match the literal string.  So, I cannot click
   Search, and then expect Find to locate the search string within
   the selected file, unless the search string was a single word.
   The inconsistent use of the target field is confusing.

4) Printing is useless.  An RTF document printed from the Library
   will have no margins, and will be silently clipped on the way
   to the printer.  If you want to print, you currently have to call
   up Edit on the current file.
-=-
J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely)
"Who is it *this* time?"
	"Concert promoters who have gone broke organizing
	 charity benefit concerts.  We call it Aid Aid."