Xref: utzoo comp.sys.mac:19865 comp.databases:1329 comp.sys.mac.programmer:2257 Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!nrl-cmf!cmcl2!phri!cooper!dasys1!alexis From: alexis@dasys1.UUCP (Alexis Rosen) Newsgroups: comp.sys.mac,comp.databases,comp.sys.mac.programmer Subject: Databases: distributed vs. monolithic file structure (was Re: FoxBase) Message-ID: <6178@dasys1.UUCP> Date: 31 Aug 88 08:57:41 GMT Lines: 139 There has been some discussion recently in comp.sys.mac concerning the relative merits of two different methods of storing database files: the monolithic (all-in-one-file) way and the distributed (files-all-over-the-place) method. Since this issue is very important (to those of us who use databases), and apparently not too well understood, I will try to explain why the distributed method is far superior to the monolithic method. The benefits of monolithic structure are few. First, you don't have to look at all those files filling up your disk. This is especially important to some consultants who want to present a neat appearance to their clients (no joke- I've seen people choose not to buy a product for this reason alone). Also, there is less chance of losing one file and not noticing it. I do not know of any other reasons for using the monolithic structure. Both of the reasons listed above are purely aesthetic. As far as I'm concerned, just put everything in one folder (subdirectory) and there's no more problem. Some of the monolithic-structure databases are: Mac- Omnis 3+, Helix, Fourth Dimension (4D) PC- R:Base The benefits to a distributed structure are far more concrete and important. First of all, in today's imperfect (i.e., non-fault-tolerant) world, keeping all your data files (relations, tables) and indexes in separate files isolates any corrupt structures. If you lose power while creating an index on file A, that's what you have to fix: the index, and only the index. With the Monolithic structure, it is possible (or guaranteed, depending on your particular choice of DBMS) that you will corrupt some other portion of your database. Good luck finding out what got damaged- it may take you weeks to find it, by which time all your backups may have the damage as well. In rare cases you could trash your entire database. Fortunately, most DBMSs today are more robust than that. The danger, however, never goes away entirely. This reason alone should convince most people. The other overriding reason to use a distributed structure is performance. If your DBMS has to go through its own file-management code as well as the OS's, it will always be slower than if it only needed to go through the OS. In the first case, to write file A the DBMS must first determine where in the logical data file (table) the data goes. Then it must find the location of the table in the physical database file. Then it can tell the OS to find that sector on the disk and update it. With a distributed structure, the middle step doesn't exist. This time savings is not immense if the DBMS is very well-written, but the vast majority are not (at least when it comes to speed optimization). There is a much bigger performance gain for distributed structures in a multiple-machine or multiple-hard-disk environment. While this is a very complex subject, a few examples will make my point. All of these techniques are used in systems I have built and they can produce gains of one hundred to ten-thousand percent! Case 1: There is a very large file with several indices associated with it. Many people use this file, and some of those use the indices. Even with a very fast disk, access is slow, and the file is too large for a cache to provide significant help. The solution is to put the indices on a separate hard disk (same server). This results in immense speed improvements. The reasons are as follows: 1) The CPU is inherently faster than the disk. The bottleneck is in getting data off of the disk. By pulling information from two disks at once, data can be retrieved twice as fast. 2) Even more significant is the fact that the disk with the data file is doing far fewer seeks than before. Seeking is a disproportionately slow operation for computers. The result is a massive performance gain for those using the indices. Those who are not using the indices (and are generally going through the datafile sequentially) will gain even more. There are some situations where you can do even better that that. If you have a group of files which are generally used for look-up information, and usually not written to, you can put them all, along with their indices, on a RAMdisk. In one extreme case this resulted in a program performing *100 TIMES* better than it had before. There are all sorts of creative ways to distribute your files for maximum gain. Because every situation is different, it is impossible to tell just how much faster things will be, but I've seen improvements from a minimum of 100%, to a more average 300%, up to 10,000% (!) in extreme cases. Case 2: There are many nodes on a LAN all accessing a database, doing fundamentally different jobs with the same data. If all of the files (tables) in the database are left on the server, you have an especially bad bottleneck problem if any node other than the server wants to do some large-scale data manipulation. In particular, on slower networks like AppleTalk or StarLAN, or (even worse) on leased-line dial-in access, some tasks become impossible, either because they will bring the network to its knees or because they cannot be completed before it's time to start them again (i.e. a task which must be performed every hour taking 90 minutes to complete). With a monolithic database you are S.O.L., but with a distributed structure you can just send the relevant fields (columns) of the relevant records (rows) to a hard disk local to the node which needs to manipulate the data. When the task is finished, the data is reloaded into the main file(s). Of course there are important logistics to consider, such as how to lock out access to data which is temporarily invalid because it is being updated privately by that node, but often this is not a problem at all. Even when it is, it's better that not being able to do the job at all. This technique, which is actually a form of coarse-grained distributed processing, can make possible tasks that are simply beyond the powers of a monolithic system (in one case I had to deal with), and can speed up other jobs by tens or hundreds of times. Just as importantly, it can lighten the load on your network considerably. There is one other very important reason to use a distributed structure that comes to mind. Any monolithic structure will impose arbitrary restrictions on the number of data files or fields (tables and columns) allowed in the database. Sometimes these restrictions are very severe. Furthermore, if you have a very large database, it may not fit on one physical disk, and with the monolithic structure you are limited (generally) to one device. With the distributed structure, these limitations just go away. Some of the distributed-structure databases are: Mac- FoxBase+/Mac, McMax PC- dBase III+, PowerBase, FoxBase, Clipper, dbXL, etc. UNIX- Most systems, I think (it's been a long time...) ----- There are many more things to consider, but I've tried to hit the most important ones. There are also exceptions. For example, one of the big DBMSs that runs under UNIX (RTI Ingres? Oracle?) bypasses the file system to write directly to the disk (let's skip the technical details...). While this is like the monolithic system in some ways, it still allows for some of the benefits of the distributed structure. UNIX wizards may have a lot to say about this... If you think I've missed any major points here, or that I've short-changed the monolithic-style DBMSs, please let me know why, and give an example. PS to Mac people- this is another good reason to like FoxBase... The only other Mac DMBS that I'm aware of which uses the distributed file structure is McMax, which is just a failed FoxBase. ---- Alexis Rosen {allegra,philabs,cmcl2}!phri\ Writing from {harpo,cmcl2}!cucard!dasys1!alexis The Big Electric Cat {portal,well,sun}!hoptoad/ Public UNIX if mail fails: ...cmcl2!cucard!cunixc!abr1 Best path: uunet!dasys1!alexis