Path: utzoo!mnetor!tmsoft!torsqnt!news-server.csri.toronto.edu!bonnie.concordia.ca!thunder.mcrcim.mcgill.edu!snorkelwacker.mit.edu!think.com!zaphod.mps.ohio-state.edu!rpi!crdgw1!barnett From: barnett@grymoire.crd.ge.com (Bruce Barnett) Newsgroups: comp.databases Subject: Summary: Literature on two-phase commit, distributed databases, etc. Message-ID: Date: 31 Jan 91 14:19:23 GMT References: <87821@tut.cis.ohio-state.edu> Sender: news@crdgw1.crd.ge.com Reply-To: barnett@crdgw1.ge.com Organization: GE Corp. R & D, Schenectady, NY Lines: 111 In-reply-to: doug@bear.cis.ohio-state.edu's message of 30 Jan 91 16:00:26 GMT Here is the summary I got. Thanks guys! From: Tore.Saeter@elab-runit.sintef.no The standard book I read as an general introduction to distributed databases was "Distributed Database Principles and Systems", Stefano Ceri, Giuseppe Pelagatti, McGraw-Hill Book Comp., 1985. ----------------- From: telxon!craign@uunet.uu.net (Craig Nirosky ) Regarding your request for information about two-phase commit, etc., I would recommend the "classic": An Introduction to Database Systems Volume II C. J. Date Addison-Wesley Publishing Company ---------------------- From: Richard Bielak Ha, ha. I guess these guys never heard of Murphy's Law. Anyway, there was a descent article about TP systems in a recent issue of "Communications of the ACM". It was the November 1990 issue. The article was titled "Transaction Processing Monitors". Another reference would be any data base textbook. I personally like the one by Ullman (sp?) called "Database Systems". ------------ From: sanjay@cs.wisc.edu (Sanjay Krishnamurthi) Unless you want to get hold of old publications the best place to look in any modern text that has a chapter on distr. databases. The book by Korth would be adequate. If you want more detail I suggest you look at Ceri & Pellagati. It has an extensive bibliography as well. It is Database Systems Concept : Henry Korth & A Silberschatz. ISBN 0-07-044752-7 > Re: [book by Ceri & Pelagatti] An excellent text but a little to detailed if you only want to point out the potential problems. ---------------------------------- From: weems@evax.uta.edu (Bob Weems) I have taught a graduate-level distributed DB course for the last five years. I would recommend the book by Ceri and Pelagatti as a starting point, even though it is dated. It is published by McGraw-Hill. The recent text my Tamer Ozsu and published by Prentice-Hall is essentially a superset of C&P and is current. As far as hacking together a DDB, it is probably doable within a relatively small organization, but will sacrifice many features that will make the "distributed" adjective inappropriate. A few thoughts. 1. Will the data be fully replicated at all sites? This trivializes query processing and simplifies transaction management if you were already assuming that some replication was needed. 2. Is full transparency to be supported. Assuming that a relational model is being used, a relation may be partioned by tuples or columns into units called fragments. Each fragment is accompanied by an expression called a qualification that describes its contents. Most vendors/application developers believe that fragmenting into sets of tuple is more important than fragmenting by columns. Nonetheless, doing this in a general sense requires that you are correctly manipulating the qualifications to minimize the amount of data being operated on. Simple solutions exist, but these are about as elegant as scanning an entire DB when one tuple is needed. 3. (Related to 2) After fragmenting your relations according to anticipated queries and transactions, now you must assign copies to appropriate sites. In general this can be a monstrous operations research problem, but if the DB software is expected to get substantial use the applications developers mus able to know how a particular assignment of copies to sites might be beneficial. The big boys cannot afford to move around copies frequently just to "get it right". 4. Query processing. If all transactions and queries operate on a handful of tuple this may not be an issue. If large sets of tuples are to be manipulated (as might occur even in SQL), it may be CRITICAL to use semijoin reduction techniques. Even though a lot of theory exists on this, at least some heuristic application of the idea is needed. They should look at all the options that the IBM R* prototype evaluates. It's time-consuming, but worthwhile. 5. Transaction management. Are they even doing the centralized recovery in a 1990's kind of way? All of the big boys use a variation of a undo/redo technique that does not need a write-through the massive cache they are using with the disks. If they are using a distributed two-phase commit, how are they handling deadlocks> By building a path-pushing cycle detector? By timeouts? It will not take care of itself, but for ordinary systems there should be a straightforward solution (it may not "bite them on the rear"). I don't recall if you mentioned it, but are they at least starting from the source code of somebodies centralized DBMS. Building that is likely to be a 1-2 year project by itself. 6. Are they using a relational model? If they are using a navigational model (Codasyl or IMS-like), they will get clobbered by network latency if they are going any distance for long transactions. These are workable on a lightly-loaded LAN, but long-haul speed is just not that of disk-to-cpu. ------------------ -- Bruce G. Barnett barnett@crd.ge.com uunet!crdgw1!barnett