Path: utzoo!mnetor!uunet!mcvax!enea!liuida!gorry From: gorry@senilix.liu.se (Goran Rydquist) Newsgroups: comp.databases Subject: Re: Database implementation/theory issues? Message-ID: <733@senilix.liu.se> Date: 4 Mar 88 02:41:16 GMT References: <33671UH2@PSUVM> <232@cullsj.UUCP> <725@smidefix.liu.se> <415@white.gcm> Organization: CIS Dept, Univ of Linkoping, Sweden Lines: 98 I wrote this some time ago >)The greatest evil in today database theory is the unquestioned (?) assumption >)that the data model must be based on records. and got answers like >In twenty-five pages the author must have had some arguments so ... I'll give you some of my own arguments, much inspired by the original article of course. Let's start with a definition [also by W. Kent]. "By record we mean here a fixed sequence of field values, conforming to a static description usually contained in catalogs and/or in programs. The description consists mainly of name, length and data type for each field." The static description in the definition is usually referred to as the schema. The phrase "conforming to", implies that the schema is *extracted* - that is the information needed to interpret (or at least process) the data is stored separately from the record itself. The major idea of the record is that the schema is extracted. The motivation is that we save space by avoiding the repetition of the same information. The reversed view is the distributed schema. By a distributed schema I mean that the information to interpret a data instance is stored explicitly together with that instance. Casually glancing at a record data model, the schema appears to be distributed. The extraction is a computational, machine-oriented way of handling large amounts of data. The space saved by extracting the schema easily becomes illusory. The resulting rigid system does not handle variation well and a user is confronted with the unnatural requirement of predicting the worst case. This estimate is then allocated in every instance, resulting in much waste. A person is a good example of an entity in the real world. What attributes would be needed to model a person. Consider name, address, social security number, length, age, sex, maiden name etc. All of these attributes are not be needed for every person instance - some people haven't got a social security number, only girls have maiden names etc. Person 1 Person 2 ---------------------- ---------------------- name "Stan Smith" name "Ann Smith" address "Park Avenue 32" address "Main street 1" length 6' length 5' age 24 age 22 maiden-name "Jones" To accommodate the variations, we could: - Define the record format to include the union of all relevant fields, where not all the fields are expected to have values in every record. These null values naturally leeds to storage overhead, a user or application programmer is forced to predict every possible field that may appear in a person record, and there is no restriction on what fields should have values when. - Allow the same field to have different meanings in different records. The meaning of the field would then be interpreted by adding an extra type field to the record. Unfortunately the interpretation of this record will only be known by the application that conceived it. The database and independent applications treats the two conceptually associated fields as separate chunks of data, with no known restrictions. Further, space will be wasted if not all the data in the union happens to be of equal size. - Define a new record type for every combination of fields. This approach eliminate the storage space overhead, but if the data varies too much, the system will be littered with record types. The desired correspondence between entity and record disappears completely, and no restrictions exist that prevent two records to model the same entity at the same time. Suppose we have a bank account record type. An account can be allocated to either a corporation, or to a person. This relationship is naturally modeled by having an owned-by field in the account record. The problem arises because persons are identified by social security number, while corporations are identified by name (string). The record modeling problems and the possible solutions are similar to the ones that were described in the previous example. We could possibly use a generic pointer type or something like that, but what we really want is that the value of the owned-by field should be able to assume more than one type. A record is far from self-describing. Consider the problem of coding a generic procedure that prints records in a common format. Such a procedure must minimally know what data it is going to print, and the format of this data. Programming languages typically use compilation to hard-wire the schema into the code, which leaves no possibilities of querying the record instance of its composition. Yeah man! - gry --- Goran Rydqvist gorry@majestix.liu.se ---------