Newsgroups: comp.archives Path: utzoo!utgpu!!!emv From: (Hans van Halteren) Subject: [comp.newprod] Database for/with (syntactic analysis) trees Message-ID: <> Followup-To: poster Sender: (Edward Vielmetti) Reply-To: (Hans van Halteren) Organization: University of Nijmegen, The Netherlands References: <> Date: Thu, 7 Mar 1991 07:26:35 GMT Approved: (Edward Vielmetti) X-Original-Newsgroups: comp.newprod Archive-name: text/syntax/ldb/1991-03-05 Archive-directory: [] Original-posting-by: (Hans van Halteren) Original-subject: Database for/with (syntactic analysis) trees Reposted-by: (Edward Vielmetti) - The book "Linguistic Exploitation of Syntactic Databases", about the use of the Linguistic DataBase, is now available (270pp., Hfl. 70). - New also is a freely copyable demo version for MSDOS. See below for details and for a general introduction to the LDB. ===== The Linguistic DataBase (LDB) The LDB is a database system developed by the TOSCA group at Nijmegen University which allows linguists who are not experts in computing to access syntactically analyzed corpora. The data in the database comprises `syntactic analysis trees' of the contiguous utterances in a natural-language text. Since these trees are built from a continuous text, they give a good representation of actual language use and can thus provide a testing ground for linguistic hypotheses. The range of extractable information in such a database is mainly dependent on the degree to which the text has been prepared. Formerly studies of corpora were restricted to the level of words or word-classes, but with the Linguistic DataBase it becomes possible to extend these studies to the level of syntax, so that larger constituents can be analyzed. Unlike currently available database packages, the LDB has been created specifically to handle the type of data linguists need to analyze - a labelled tree structure with a variable number of branches at each node and the possibility of recursion. The LDB can be used to examine the trees on the terminal screen, search for utterances with given properties, and handle database-wide queries about constructs in the utterances. The LDB does not presume special graphics hardware. For this reason it has been implemented for common machines (VAX and IBM PC/AT) and common terminals (VT100, ADM3, etc.). Where possible, special terminal features are used, such as highlighting and graphics characters, but even on the so- called `dumb' ADM3A the trees are represented by an acceptable imitation of graphics. Terminal types not already provided for can be easily installed by the user. The LDB also does not presume a computationally expert user. Thus control of the program is designed to be simple and clear. The overall control is handled by a menu system, which displays short descriptions of the choices, each of which can be activated by a single keystroke. In the Tree Viewer, which is used to examine an analysis tree on the terminal screen, there is not enough space left on the screen to produce these descriptions, so that commands (mostly of one keystroke) are listed in abbreviated form. A description of all commands can be accessed by a `help' command, however. For queries going beyond a single tree, the Exploration Scheme formalism has been developed. An Exploration Scheme consists of a search pattern, itself a tree much like the analysis trees, and a specification of the operations to be performed on the information the pattern discovers. The possibilities of Exploration Schemes are various. They range from a simple search for a tree, in order to examine it with the Tree Viewer, to the creation of frequency tables. The formalism is designed in such a way that the novice can start exploring immediately. From there, he can gradually expand his knowledge to the more complex features. In order to facilitate formulating Exploration Schemes the LDB has a special scheme editor. The LDB package comes with the Nijmegen Corpus, a 130,000 word collection of modern British English with a full syntactic analysis of each utterance. To each node in the tree (i.e. each constituent in the utterance) has been attached a function and a category label. In the future more corpora will become available. Furthermore, since the database system is independent of both formalism and language, it is possible to use it for any other kind of analyzed corpus. The LDB package requires (1) VAX with VMS; (2) IBM PC (AT preferred), 640K RAM, hard disk, at least one 1.2 Mb high-capacity diskette drive, MS-DOS, no special graphics hardware; or (3) any UNIX machine, competent C-compiler, enough knowledge about terminal and file I/O to be able to configure the program to the system. Not copy protected. Source code (ca. 25,000 lines of CDL2) not available. It costs Hfl. 100 (academic institutions), Hfl. 5000 (other). [as of Jan. 1991 Hfl. 1 is about $ 0.60] A user manual is not included in the academic distribution; the book Linguistic Exploitation of Syntactic Databases (see publications) contains all necessary information and is priced at Hfl. 70. A (fully functional) demonstration version for any MSDOS machine with harddisk is available - on a 5.25" 360K diskette from the address below - by ftp at in the directory pub/LDB - by listserv from LISTSERV@HEARN as files LDBDEMOC INF TOSCA-L LDBDEMOC UUE TOSCA-L For more information contact Hans van Halteren TOSCA Group Department of English University of Nijmegen P.O. Box 9103 6500 HD Nijmegen The Netherlands tel: (+31)-080-512836 e-mail: Publications van Halteren, Hans and Nelleke Oostdijk. ``Using an Analyzed Corpus as a Linguistic Database'', in Computers in Literary and Linguistic Computing, Proceedings of the XIIIth ALLC Conference (Norwich 1986), John Roper (vol. ed.), J. Hamesse and A. Zampolli (series eds.) van Halteren, Hans and Theo van den Heuvel. Linguistic Exploitation of Syntactic Databases. (Rodopi, Amsterdam 1990). de Haan, Pieter. ``Exploring the Linguistic Database: Noun Phrase Complexity and Language Variation'', in Corpus Linguistics and Beyond, Willem Meijs, ed. (Rodopi, Amsterdam 1987).