Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!diplodocus.cis.ohio-state.edu!jgreely From: jgreely@diplodocus.cis.ohio-state.edu (J Greely) Newsgroups: comp.sys.next Subject: Re: NeXT's Digital Library Message-ID: <30085@tut.cis.ohio-state.edu> Date: 27 Dec 88 21:39:19 GMT References: <19728@ames.arc.nasa.gov> <5037@phoenix.Princeton.EDU> Sender: news@tut.cis.ohio-state.edu Organization: THE Ohio State University, CIS Dept. Lines: 161 In article <5037@phoenix.Princeton.EDU> levy@Princeton.EDU (Silvio Levy) writes: [in reply to "Mike of the silly return address" stating that he found "all 3 entries" of a word in the Librarian] >Huh? I'm mystified. What do you mean by ``all 3 entries''? Using >the UNIX utility `grep' I found 17. Yes, boys and girls, the correct phrase is all *indexed* entries of a word. Actually, to be more precise, I should say, "all files for which a word is indexed", since the indexing is at the file level. Indexing in general is a very-beta operation, and the current scheme is listed in the release notes with: This set of tools is not supported. It will change between now and the 1.0 release, but it does give a flavor of things to come. Since the indexing library is at the heart of the lookup problems, simply bear with it until it is replaced by a better scheme. Actually, what's there is very nice. The db library is dbm done right, and the idea behind pword is excellent (although its current reliance on modern english is unfortunate; this is one of the major reasons why the indexing in Shakespeare isn't as good as it could be). I have great hopes that db will eventually find its way out into the world (I'd love to work over everything around here that relies on dbm, and insert db instead. This would probably solve several of our problems with yp). >You can only search for words, not for strings or phrases. >This means if to find out where S. wrote ``To be or not to be'', >you'd have to wade through thousands of occurrences of ``to'', >``be'', ``or'' or ``not''. But read on. This is a combination of things. Do you really want all occurrences of "to"? Quick check shows there to be more than 16000 of them, scattered throughout over 6000 files. Common noise words are eliminated from the index as a design decision. As for the inability to search for a phrase, this is acknowledged as a limitation in the release notes. Also, the above statement is not quite true. You can search for ["and"|"or"|"and not" ...] which, if the words you want are indexed, will narrow the search for you. My stock example is locating the line "Ready, so please your grace" in The Merchant of Venice. Not a very important line, but it stuck in my memory from when we performed the play. The only word that is indexed is "grace", which is occurs 75 times. The one (reasonable) search that will uniquely locate it is "merchant and grace" (Merchant of Venice, Act 4, Scene 1, second line). >Apparently very common words cannot be used as search keys at all-- >you get a ``0 found'' response. This is the case with the four words >mentioned above. Together with feature #1, this means that the Digital >Librarian simply won't locate S.'s most famous quotation. Correct. At present, that quote (as well as several others I've tried) cannot be found from within the Library as is. However, if you know any of the surrounding context, you're better off. I happen to remember that the line comes from Hamlet, and that the quote continues with "that is the question. Whether 'tis nobler...". Searching for "nobler" will return 19 files, while "hamlet and nobler" will return the correct section (Hamlet, Act 3, Scene 1). From there, a Find on "nobler" will put you at the correct location in the file. Mind you, you'll never find "Now is the winter of our discontent made glorious summer by this son of York", unless you know that it's the first two lines of Richard III. Incidentally, Library reports this line as "...son of York", while Quotations claims that it's "...sun of York". Typo, anyone? >UNDESIRABLE FEATURE #3: [indexing stores the first line(s) of the file, rather than the context of the match] Agreed. The context would be more useful, but I don't think this will change. The index is built at the file level, so all it knows is that the word is important enough to be indexed for that file. If it returned context, it would be the context of the first entry, and not necessarily the one you want. >UNDESIRABLE FEATURE #4: [embedded rtf, rather than something brighter] This looks like a feature, since low-level encoding requires less intelligence than full TeX-like macros. Not having any documentation on the Microsoft RTF format, I can't say whether it is capable of more sophisticated (read that, "higher level") formatting. >Now for the bugs: > >BUG #1: [bug, feature, same difference] >Not all occurrences of a word are found -- far from it. I recommend to you the manual page for "pword". This will help clarify how the indexing is currently done. The object is to index all *significant* words, based on the surrounding context. A document with frequent mention of horses is more likely to have "horse" indexed than one where it's only mentioned once. Note that the documentation for pword is slightly out of date, and will hopefully be correct by 0.9 (for the correct options, use "pword -:"). One other problem is picking Shakespeare for this discussion. The frequency tables used for the indexing appear to be the Modern English version, rather than one more appropriate for the work. In particular, the stop list does not include noise words like thee, thy, thou, etc., instead indexing them quite heavily ("thou" is indexed 315 times, for example). >BUG #2: >Treatment of plurals, etc. is inconsistent. Words are "singularized", but no mention is made of the technique used. It is quite likely that the method currently used isn't as bright as one might hope. Now, to toss in a few of my own (my complete list is a bit too large to post, so I'll limit myself to a few things you didn't mention about the Library): 1) A lower-case search string will perform a case-insensitive search, while an upper-case character will force an exact match. Nice in theory, but it doesn't work. Searching (in Shakespeare) for "Merchant and grace" will return all 75 matches for "grace", while "merchant and grace" will return the unique match that I'm looking for. 2) There is no way to pull up an arbitrary file into the Library, except as the result of a search. For example, if Act 4, Scene 1 comes up as the result of a search, I cannot simply proceed to the Scene 2 if I wish to continue reading. I can open an Edit window containing it, but I can't pull it into the Library unless I can match it with a search. This is the most serious limitation of the program for me, and the one I most want to see changed by 1.0. I want to be able to browse through the files contained in the current database, without leaving the Library. 3) The target field is shared by the Search and Find buttons, but not by the Open button, which instead pulls up a Browser window. Better yet, Search understands multiword targets, while Find will attempt to match the literal string. So, I cannot click Search, and then expect Find to locate the search string within the selected file, unless the search string was a single word. The inconsistent use of the target field is confusing. 4) Printing is useless. An RTF document printed from the Library will have no margins, and will be silently clipped on the way to the printer. If you want to print, you currently have to call up Edit on the current file. -=- J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely) "Who is it *this* time?" "Concert promoters who have gone broke organizing charity benefit concerts. We call it Aid Aid."