Path: utzoo!attcan!uunet!cs.utexas.edu!read From: read@cs.utexas.edu (Rob) Newsgroups: bionet.molbio.genome-program Subject: Re: bnf for genbank Summary: The published grammar is ambiguous Keywords: genbank, parser Message-ID: <938@bovina.cs.utexas.edu> Date: 1 Oct 90 19:31:55 GMT References: <9009242002.AA05503@genome.lanl.gov> Followup-To: GenBank Parser Organization: U. Texas CS Dept., Austin, Texas Lines: 265 Hi. The Feature Table Grammar published in the document "The DDBJ/EMBL/GenBank Feature Table: Definition", Version 1.03 August 6,1990, is technically ambiguous. There are two legal parses of some strings. The ambiuities are related to the many values that a qualifier value may have. Some qualifiers (e.g. /number) have integer values. However, a single integer is also a legal location specifier. So we have two derivations: 0) simple_value-> location-> absolute_location-> local_location-> base_position-> INTEGER 1) simple_value-> INTEGER Similarly, the grammar expresses that a may be a symbol, but a symbol may also be a . Thus: 0) simple_value-> location-> feature_name-> feature_label-> SYMBOL 1) simple_value-> SYMBOL are both legal derivations of this grammar. These technical ambiguities can be resolved by making some assumptions: 0) That a lexer can discriminate the controlled vocabularies of enumerated qualifiers, and that there are no unquoted strings used for qualifier values that are not feature labels or from the controlled vocabulary. This allows us to replace in the rule for with , a token for lexically recognized words in the controlled vocabularies. 1) That a lexer can discriminate qualifiers which take integer values. (This is not possible if a qualifier can have integer values and location values.) This allows us to remove the INTEGER token form the simple-value rule. These assumptions might be enforcible by the GenBank administration. Dr. Dan Davison (davison@menudo.uh.edu) provided me with the original bison grammar and the idea that the grammar was ambiguous. The grammar below shows my modifications. It "bisons" successfully with 10 shift/reduce errors and 0 reduce/reduce errors. --------------------------------------------------------------- Robert L. Read GENTOOLS PROJECT -- University of Texas Sarah K. Barron Center for High Performance Computing Matthew Witten read@cs.utexas.edu --------------------------------------------------------------- /* Cut Here */ /* WARNING -- This seemingly reasonable decision to make 's, and lexically recognized tokens may have to be amended. I don't think text_strings and literal_sequences can be lexically distinguished. It is unclear how much of the decoding must be done in the parser and what can be done in the lexer. */ /* This feature table grammar was originally produce by Dan Davison of */ /* UH -- davison@menudo.uh.edu */ /* I have modified it in an attempt to disambiguate it as reasonably as */ /* as possible; I make some assumptions about how values are used which */ /* 0) might not be intended by GenBank */ /* 1) might not be enforced by GenBank */ /* See comments below for details */ /* Robert L. Read GenTools Project - UT-CHPC */ %{ #define YYSTYPE double #define YYDEBUG #include #include %} %token SYMBOL %token INTEGER %token ELLIPSE %token DOUBLE_COLON %token COMPLEMENT %token JOIN %token ORDER %token GROUP %token ONE_OF %token REPLACE %token TEXT_STRING %token HEADER_1 %token HEADER_2 %token INTEGER_QUAL /* This token is produced by the lexer for the hopefully finite set of qualifier names which can be followed by a location. */ %token CONTROL /* This token is should be provided by the lexer when a "controlled vocabulary" value is recognized. This may be a painful or incomplete context sensitive lexing task. In particular, it assumes that each qualifier that can have an unquoted string value has a finite set of values. */ %% /* grammer rules and actions follow */ feature_table: feature_table_header feature_table_body ; feature_table_header: HEADER_1 | HEADER_2 ; feature_table_body: feature | feature_table_body feature ; feature: feature_key feature_details ; feature_key: SYMBOL | '-' ; feature_details: location qualifier_list | location ; location: absolute_location | feature_name | '"' literal_sequence '"' | functional_operator '(' location_list ')' ; /* I think these rules as written made the parser look too far ahead. I expanded "path" in the rules below. */ absolute_location: local_location | SYMBOL ':' local_location | SYMBOL DOUBLE_COLON SYMBOL ; feature_name: SYMBOL DOUBLE_COLON SYMBOL ':' SYMBOL | SYMBOL ':' SYMBOL | SYMBOL ; literal_sequence: SYMBOL ; functional_operator: | COMPLEMENT | JOIN | ORDER | GROUP | ONE_OF | REPLACE ; location_list: location | location_list ',' location ; local_location: base_position | between_position | base_range ; /* These rules had to be expanded to remove reduce/reduce conflict path: database DOUBLE_COLON primary_accession | primary_accession ; feature_label: SYMBOL ; */ base_position: INTEGER | low_base_bound | high_base_bound | two_base_bound ; between_position: base_position '^' base_position ; base_range: base_position ELLIPSE base_position ; /* These rules had to be expanded to remove reduce/reduce error database: SYMBOL ; primary_accession: SYMBOL ; */ low_base_bound: '>' INTEGER ; high_base_bound: '<' INTEGER ; two_base_bound: base_position '.' base_position ; qualifier_list: qualifier | qualifier_list qualifier ; /* Here I am assuming the lexer can distinguish a location- parametered qualifier from others. */ qualifier: '/' qualifier_name | '/' qualifier_name '=' value | '/' integer_qualifier '=' INTEGER ; integer_qualifier: INTEGER_QUAL ; qualifier_name: SYMBOL ; value: simple_value | '(' value_list ')' | '(' tagged_value_list ')' ; /* In the published grammar, and integer qualifier value is ambiguous because: simple_value-> location-> absolute_location-> local_location-> base_position-> INTEGER is legal, so either that path must be changed or INTEGER must be excluded for simple_value. I therefore count (above) on being able to discriminate INTEGER valued qualifiers from others. Similarly, simple_value-> location-> feature_name-> feature_label-> SYMBOL, so SYMBOL cannot be part of legal value. I count on being able to lexically distinguish the the "controlled vocabularies" which are legal values for particular qualifiers from "symbols". */ simple_value: location | reference_number | '"' TEXT_STRING '"' | CONTROL ; value_list: value | value_list '.' value ; tagged_value_list: tagged_value | SYMBOL ':' '(' value_list ')' | SYMBOL ':' '(' tagged_value_list ')' | tagged_value_list ',' tagged_value ; /* This rule produces no reduce/reduce errors. We might use it if we wanted to allow INTEGERS as tagged values. tagged_value: tag ':' '(' value_list ')' | tag ':' '(' tagged_value_list ')' | tag ':' INTEGER | tag ':' '"' TEXT_STRING '"' | tag ':' SYMBOL | tag ':' reference_number ; */ tagged_value: tag ':' value ; tag: SYMBOL ; reference_number: '[' INTEGER ']' ; %% yyerror (s) char *s; { printf ("%s\n", s); }