Path: utzoo!mnetor!uunet!mcvax!enea!liuida!gorry
From: gorry@senilix.liu.se (Goran Rydquist)
Newsgroups: comp.databases
Subject: Re: Database implementation/theory issues?
Message-ID: <733@senilix.liu.se>
Date: 4 Mar 88 02:41:16 GMT
References: <33671UH2@PSUVM> <232@cullsj.UUCP> <725@smidefix.liu.se> <415@white.gcm>
Organization: CIS Dept, Univ of Linkoping, Sweden
Lines: 98

I wrote this some time ago

>)The greatest evil in today database theory is the unquestioned (?) assumption
>)that the data model must be based on records.

and got answers like

>In twenty-five pages the author must have had some arguments

so ... I'll give you some of my own arguments, much inspired by the original
article of course.

Let's start with a definition [also by W. Kent].

"By record we mean here a fixed sequence of field values, conforming to a
static description usually contained in catalogs and/or in programs. The
description consists mainly of name, length and data type for each field."

The static description in the definition is usually referred to as the schema.
The phrase "conforming to", implies that the schema is *extracted* - that is
the information needed to interpret (or at least process) the data is stored
separately from the record itself.

The major idea of the record is that the schema is extracted. The motivation
is that we save space by avoiding the repetition of the same information. The
reversed view is the distributed schema. By a distributed schema I mean that
the information to interpret a data instance is stored explicitly together
with that instance. Casually glancing at a record data model, the schema
appears to be distributed. The extraction is a computational, machine-oriented
way of handling large amounts of data.

The space saved by extracting the schema easily becomes illusory. The
resulting rigid system does not handle variation well and a user is confronted
with the unnatural requirement of predicting the worst case. This estimate is
then allocated in every instance, resulting in much waste.

A person is a good example of an entity in the real world. What attributes
would be needed to model a person. Consider name, address, social security
number, length, age, sex, maiden name etc. All of these attributes are not be
needed for every person instance - some people haven't got a social security
number, only girls have maiden names etc.

Person 1                         Person 2                   
----------------------	         ----------------------     
name        "Stan Smith"         name        "Ann Smith"    
address     "Park Avenue 32"     address     "Main street 1"
length      6'		         length      5'             
age         24		         age         22             
			         maiden-name "Jones"        

To accommodate the variations, we could:

	- Define the record format to include the union of all relevant
	fields, where not all the fields are expected to have values in
	every record. These null values naturally leeds to storage overhead,
	a user or application programmer is forced to predict every possible
	field that may appear in a person record, and there is no
	restriction on what fields should have values when.

	- Allow the same field to have different meanings in different
	records.  The meaning of the field would then be interpreted by
	adding an extra type field to the record. Unfortunately the
	interpretation of this record will only be known by the application
	that conceived it. The database and independent applications treats
	the two conceptually associated fields as separate chunks of data,
	with no known restrictions. Further, space will be wasted if not all
	the data in the union happens to be of equal size.

	- Define a new record type for every combination of fields. This
	approach eliminate the storage space overhead, but if the data
	varies too much, the system will be littered with record types. The
	desired correspondence between entity and record disappears
	completely, and no restrictions exist that prevent two records to
	model the same entity at the same time.

Suppose we have a bank account record type. An account can be allocated to
either a corporation, or to a person. This relationship is naturally modeled
by having an owned-by field in the account record.

The problem arises because persons are identified by social security number,
while corporations are identified by name (string). The record modeling
problems and the possible solutions are similar to the ones that were
described in the previous example. We could possibly use a generic pointer
type or something like that, but what we really want is that the value of the
owned-by field should be able to assume more than one type.

A record is far from self-describing. Consider the problem of coding a generic
procedure that prints records in a common format. Such a procedure must
minimally know what data it is going to print, and the format of this data.
Programming languages typically use compilation to hard-wire the schema into
the code, which leaves no possibilities of querying the record instance of its
composition.

Yeah man!
		- gry
---
Goran Rydqvist			gorry@majestix.liu.se
---------