Path: utzoo!attcan!uunet!bionet!NCBI.NLM.NIH.GOV!pkarp From: pkarp@NCBI.NLM.NIH.GOV (Peter Karp) Newsgroups: bionet.molbio.genome-program Subject: Re: feature table parsers - what's it all mean? Message-ID: Date: 3 Oct 90 14:01:33 GMT Sender: daemon@genbank.bio.net Lines: 47 Let me answer the question in a different way. From a computer science point of view the GenBank features table is a language: every possible feature is a sentence (string) constructed from an alphabet of symbols (characters). A grammar is a formal mechanism for describing languages in a very precise way: a grammar specifies very clearly what sentences are legal within that language and what sentences are not. Thus, by writing a grammar for the GenBank feature table, GenBank is telling us what the allowable syntax of features is -- without it we can only guess as to what strings we (and our programs) can expect to see in feature tables. A grammar is only a specification of a language -- it is not a computer program for recognizing the language. A parser is such a program. It takes as input a sentence in the language, and breaks the sentence down into its constituent parts (for example, a parser of English would break an English sentence down into verbs, nouns, adjectives, etc.). Parsers typically take different sorts of actions when they see different grammatical elements (an English parser would take one sort of action when it sees a verb and a different action when it sees a noun). Usually one cannot writer a parser without referring to a grammar of the language. A parser generator is a program that automatically creates a parser program from a grammar (the Unix YACC program is a parser generator). The message here, as any computer scientist would have told you ten or fifteen years ago, is that anyone who defines a language (such as a text file that holds a database) that is to be used by a large number of people, will make those people's lives MUCH easier if they define a grammar for that language and make that grammar publicly available, preferably in a format that can be used by a publicly-available parser-generator program. Without such a grammar people can only guess as to the syntax of the language, people certainly cannot generate parser programs automatically, and even writing parser programs by hand can be VERY difficult. Further, the very act of writing a grammar makes the authors of such a language think very hard about its properties: some languages are much easier than others for computers to parse, and often the act of writing a grammar will cause the language authors to transform the language into a new language that is easier to parse. By analogy, releasing a language without a grammar is like having a society whose legal system operates without written laws -- there are a lot of grey areas that will make many people very nervous. In summary, IF you release a computer database as a text file, PLEASE consider writing a grammar for the language that you are using. End of sermon.