Path: utzoo!attcan!uunet!odi!jack
From: jack@odi.com (Jack Orenstein)
Newsgroups: comp.databases
Subject: Re: Extended RDB vs OODB
Message-ID: <1989Aug17.211534.28345@odi.com>
Date: 17 Aug 89 21:15:34 GMT
References: <3560052@wdl1.UUCP> <408@odi.ODI.COM> <3324@rtech.rtech.com> <1989Aug11.143036.24703@odi.com> <1765@ethz.UUCP> <5259@wiley.UUCP>
Reply-To: jack@odi.com (Jack Orenstein)
Organization: Object Design Inc., Burlington, MA
Lines: 172

In this first round of the comp.databases relational vs. OO DBMS wars,
Jon Krueger has asked some very reasonable questions that go to the
heart of the issues that Dennis Moore (RTI), Dan Weinreb and I (Object
Design), and Bruce Speyer (MCC) have been discussing.  I think the
essential statements are the following:

   ... I'd like to divide the question: efficient implementation of
   a data model versus inherently bad performance of some models for some
   operations.  Recent traffic has confused the two issues without
   addressing either.  It tells us very little that a current DBMS
   performs poorly.  References to applications without specifying their
   operations or describing their design tell us nothing.
   
   For instance, Bruce alludes to operations like "netlist a circuit" and
   "package the electronics".  It would be wonderful indeed to understand
   the electronics that underlies all the computing we do, but I'll settle
   for characterizing some operations that engineers need.  Can you
   specify these operations in some terms we can understand?  Or simpler
   ones?  How might one implement them with a relational data model?  Are
   there data models that can be shown inherently better for some of these
   operations? 

The starting point has to be the second statement, which addresses
user requirements.  One of the major conclusions of our (Object
Design's) requirements analysis was, as Dan and I have stated in
earlier postings, that "object fetching" - finding an object given its
id - must be as fast as possible.  I will therefore try to answer Mr.
Krueger's questions by focussing on this one operation.

These statements are asking, (correct me if I'm wrong), whether there
is something in the design of a given model that leads to or precludes
certain implementation techniques required for efficient
implementation of operations that are important in the application
areas being considered, (object fetching for now).

Consider an application that has to access some persistent data and
manipulate it, possibly updating it, using subroutines written in a
programming language. This is an extremely common scenario among the CAx
developers that Object Design has talked to.

Writing such an application on top of a relational DBMS requires the use
of two languages, the host programming language and the query language
of the RDBMS. The programmer therefore has to deal with two type
systems, and except for the simplest types such as integer and maybe
string, conversions are required between the DBMS representation of a
type and that of the host language. 

This problem is most severe for object ids.  On the host language
side, object ids are simply addresses, or pointers, and object
fetching involves following the pointer, (e.g.  thing* p; ...; widget*
w = p->frammis; in C or C++). On the RDBMS side, object fetching
involves at least a selection starting from a specific key value, or a
join. Each retrieval from the RDBMS will load some data that can be
accessed through host language, but when the "boundaries" of the
retrieved data are reached, it is time to submit another query.

Earlier postings have identified three patterns of "interleaving" of
host language and RDBMS actions:

        1. Retrieve only the object required at the moment by passing
	   its key to the DBMS.

        2. Retrieve all the objects that will be required for some
	   part of the application. This can be done by grouping 
	   objects (e.g. using a view), and retrieving the group members
	   by providing the key of the group.

        3. Same as 2, but the objects to be manipulated are not
           stored individually in the database. Instead, the data
           is organized as a "blob" or "long field", which is
           requested by its key, i.e. groups are replaced by blobs.
        
It sounds like Mr. Speyer used approach 1 in his application:

   About 3 years ago I tried putting an electronic information model on
   top of a relational system.  It took about 30-40 times longer to
   netlist a circuit then it did using a fairly inefficient internally
   developed memory-based database system. An operation such as packaging
   the electronics is much worse since it must transverse much more of
   the electronic information model and be constantly refering to the
   library portion of the model which was distributed to another database
   (making the join operation much more expensive).

Mr. Moore suggested that he should have used approach 2 or 3
instead, (the description is not specific enough to say which):

   Let me posit a different architecture for your electronic information
   model.  Could you have read in all the data into memory from an RDBMS
   and performed the same manipulations in-core that you did in your
   system?  The advantage to this architecture is that you can lock the
   records while you are manipulating them (with THREE WORDS ("FOR DIRECT
   UPDATE"), as opposed to many lines of code), you get all the
   transaction processing capabilities of the DBMS (i.e. rollback,
   savepoints, commit), you get all the utilities of the DBMS, etc.  To
   put it in a few words, YOU GET THE *MS* FROM THE DBMS, and you do your
   own processing.

Elsewhere, he is specifically suggesting approach 3:

   For instance, you could store a CASE diagram as a BLOB in real-time,
   and fire off an asynch database procedure which invokes a method which
   does all kinds of stuff, including storing the thing in a normalized
   fashion (for reports etc.), and potentially invoking a compiler to
   create a new whole version, etc.  Would this not be good enough?
   There will be a tradeoff between disk space and access and storage
   times, though, regardless of OO or R/OO.

#1 is too slow to be practical, as suggested by Mr. Speyer and by our
discussions with our potential customers. A query per (small) object
is too expensive.

#2 has the drawback that all members of a group must be retrieved in
order to gain access to any members. This is wasteful if only a
handful of objects were actually needed.  Furthermore, there may be a
large number of joins and selections necessary to extract the required
data, and the data then has to be converted to host-language
structures.  This is pure overhead due to the use of two languages.

#3 fails to capture any relationships internal to the long field (the
"blob") unless the programmer explicitly asks the information to be
captured and sent back to the DBMS (as pointed out by Mr. Moore).
Again, this is overhead due to the use of two languages.

Going back to Mr. Krueger's question: are these problems inherent in
the relational model? No, they are due to the two-language paradigm
supported by all RDBMS vendors. In fact, the relational model doesn't
address the issue of how to interact with a more powerful
general-purpose programming language. Languages like Pascal/R, RIGEL,
and Aldat show that a smoother integration is possible. (See Atkinson
and Buneman's extremely thorough review of DB programming languages in
ACM Surveys, June 1987.) It is extremely unlikely that any of these
languages will see widespread use, since they are non-standard (i.e.
non-C and non-SQL) replacements of existing query languages AND
programming languages.

An OO DBMS does not present users with the two-language problem
characteristic of RDBMSs.  Or at least this is true of the system we're
building at Object Design.  Instead, there is a single type system, and
a type may have both transient and persistent instances.  Once a
persistent object has been created, it can be accessed and manipulated
in the same way as any other object. Our system will be C++-based, so we
have adopted the C++ type system.

The programmer does not have to fire off a query to a DBMS in a second
language in order to access persistent data.  A pointer can be
followed in the usual way (e.g.  *p, or p->field), even if the target
is persistent.  If the object happens to be in working memory, then
nothing out of the ordinary happens, and the speed of the access is
the same as for access to a transient object (and the same as what a
C++ programmer is used to). Otherwise, the requested object, along
with some objects stored nearby, are brought in from the database
automatically.

Concurrency control and recovery are present, as with any DBMS.  There
is certainly nothing inherent in the relational model or lacking from
any OO model that limits these features to relational DBMSs.


CONCLUSION

The two-language paradigm of RDBMSs complicates the writing of
applications, and has performance consequences as well.  This is a
problem with implementations of the relational model, and not inherent
in the relational model. OO DBMSs avoid these problems by offering a
single language in which to write applications, pushing the
responsibility for database access into the system, away from the
user.


Jack Orenstein
Object Design, Inc.