Path: utzoo!attcan!uunet!odi!jack From: jack@odi.com (Jack Orenstein) Newsgroups: comp.databases Subject: Re: Extended RDB vs OODB Message-ID: <1989Aug17.211534.28345@odi.com> Date: 17 Aug 89 21:15:34 GMT References: <3560052@wdl1.UUCP> <408@odi.ODI.COM> <3324@rtech.rtech.com> <1989Aug11.143036.24703@odi.com> <1765@ethz.UUCP> <5259@wiley.UUCP> Reply-To: jack@odi.com (Jack Orenstein) Organization: Object Design Inc., Burlington, MA Lines: 172 In this first round of the comp.databases relational vs. OO DBMS wars, Jon Krueger has asked some very reasonable questions that go to the heart of the issues that Dennis Moore (RTI), Dan Weinreb and I (Object Design), and Bruce Speyer (MCC) have been discussing. I think the essential statements are the following: ... I'd like to divide the question: efficient implementation of a data model versus inherently bad performance of some models for some operations. Recent traffic has confused the two issues without addressing either. It tells us very little that a current DBMS performs poorly. References to applications without specifying their operations or describing their design tell us nothing. For instance, Bruce alludes to operations like "netlist a circuit" and "package the electronics". It would be wonderful indeed to understand the electronics that underlies all the computing we do, but I'll settle for characterizing some operations that engineers need. Can you specify these operations in some terms we can understand? Or simpler ones? How might one implement them with a relational data model? Are there data models that can be shown inherently better for some of these operations? The starting point has to be the second statement, which addresses user requirements. One of the major conclusions of our (Object Design's) requirements analysis was, as Dan and I have stated in earlier postings, that "object fetching" - finding an object given its id - must be as fast as possible. I will therefore try to answer Mr. Krueger's questions by focussing on this one operation. These statements are asking, (correct me if I'm wrong), whether there is something in the design of a given model that leads to or precludes certain implementation techniques required for efficient implementation of operations that are important in the application areas being considered, (object fetching for now). Consider an application that has to access some persistent data and manipulate it, possibly updating it, using subroutines written in a programming language. This is an extremely common scenario among the CAx developers that Object Design has talked to. Writing such an application on top of a relational DBMS requires the use of two languages, the host programming language and the query language of the RDBMS. The programmer therefore has to deal with two type systems, and except for the simplest types such as integer and maybe string, conversions are required between the DBMS representation of a type and that of the host language. This problem is most severe for object ids. On the host language side, object ids are simply addresses, or pointers, and object fetching involves following the pointer, (e.g. thing* p; ...; widget* w = p->frammis; in C or C++). On the RDBMS side, object fetching involves at least a selection starting from a specific key value, or a join. Each retrieval from the RDBMS will load some data that can be accessed through host language, but when the "boundaries" of the retrieved data are reached, it is time to submit another query. Earlier postings have identified three patterns of "interleaving" of host language and RDBMS actions: 1. Retrieve only the object required at the moment by passing its key to the DBMS. 2. Retrieve all the objects that will be required for some part of the application. This can be done by grouping objects (e.g. using a view), and retrieving the group members by providing the key of the group. 3. Same as 2, but the objects to be manipulated are not stored individually in the database. Instead, the data is organized as a "blob" or "long field", which is requested by its key, i.e. groups are replaced by blobs. It sounds like Mr. Speyer used approach 1 in his application: About 3 years ago I tried putting an electronic information model on top of a relational system. It took about 30-40 times longer to netlist a circuit then it did using a fairly inefficient internally developed memory-based database system. An operation such as packaging the electronics is much worse since it must transverse much more of the electronic information model and be constantly refering to the library portion of the model which was distributed to another database (making the join operation much more expensive). Mr. Moore suggested that he should have used approach 2 or 3 instead, (the description is not specific enough to say which): Let me posit a different architecture for your electronic information model. Could you have read in all the data into memory from an RDBMS and performed the same manipulations in-core that you did in your system? The advantage to this architecture is that you can lock the records while you are manipulating them (with THREE WORDS ("FOR DIRECT UPDATE"), as opposed to many lines of code), you get all the transaction processing capabilities of the DBMS (i.e. rollback, savepoints, commit), you get all the utilities of the DBMS, etc. To put it in a few words, YOU GET THE *MS* FROM THE DBMS, and you do your own processing. Elsewhere, he is specifically suggesting approach 3: For instance, you could store a CASE diagram as a BLOB in real-time, and fire off an asynch database procedure which invokes a method which does all kinds of stuff, including storing the thing in a normalized fashion (for reports etc.), and potentially invoking a compiler to create a new whole version, etc. Would this not be good enough? There will be a tradeoff between disk space and access and storage times, though, regardless of OO or R/OO. #1 is too slow to be practical, as suggested by Mr. Speyer and by our discussions with our potential customers. A query per (small) object is too expensive. #2 has the drawback that all members of a group must be retrieved in order to gain access to any members. This is wasteful if only a handful of objects were actually needed. Furthermore, there may be a large number of joins and selections necessary to extract the required data, and the data then has to be converted to host-language structures. This is pure overhead due to the use of two languages. #3 fails to capture any relationships internal to the long field (the "blob") unless the programmer explicitly asks the information to be captured and sent back to the DBMS (as pointed out by Mr. Moore). Again, this is overhead due to the use of two languages. Going back to Mr. Krueger's question: are these problems inherent in the relational model? No, they are due to the two-language paradigm supported by all RDBMS vendors. In fact, the relational model doesn't address the issue of how to interact with a more powerful general-purpose programming language. Languages like Pascal/R, RIGEL, and Aldat show that a smoother integration is possible. (See Atkinson and Buneman's extremely thorough review of DB programming languages in ACM Surveys, June 1987.) It is extremely unlikely that any of these languages will see widespread use, since they are non-standard (i.e. non-C and non-SQL) replacements of existing query languages AND programming languages. An OO DBMS does not present users with the two-language problem characteristic of RDBMSs. Or at least this is true of the system we're building at Object Design. Instead, there is a single type system, and a type may have both transient and persistent instances. Once a persistent object has been created, it can be accessed and manipulated in the same way as any other object. Our system will be C++-based, so we have adopted the C++ type system. The programmer does not have to fire off a query to a DBMS in a second language in order to access persistent data. A pointer can be followed in the usual way (e.g. *p, or p->field), even if the target is persistent. If the object happens to be in working memory, then nothing out of the ordinary happens, and the speed of the access is the same as for access to a transient object (and the same as what a C++ programmer is used to). Otherwise, the requested object, along with some objects stored nearby, are brought in from the database automatically. Concurrency control and recovery are present, as with any DBMS. There is certainly nothing inherent in the relational model or lacking from any OO model that limits these features to relational DBMSs. CONCLUSION The two-language paradigm of RDBMSs complicates the writing of applications, and has performance consequences as well. This is a problem with implementations of the relational model, and not inherent in the relational model. OO DBMSs avoid these problems by offering a single language in which to write applications, pushing the responsibility for database access into the system, away from the user. Jack Orenstein Object Design, Inc.