Xref: utzoo comp.sys.mac:19865 comp.databases:1329 comp.sys.mac.programmer:2257
Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!nrl-cmf!cmcl2!phri!cooper!dasys1!alexis
From: alexis@dasys1.UUCP (Alexis Rosen)
Newsgroups: comp.sys.mac,comp.databases,comp.sys.mac.programmer
Subject: Databases: distributed vs. monolithic file structure (was Re: FoxBase)
Message-ID: <6178@dasys1.UUCP>
Date: 31 Aug 88 08:57:41 GMT
Lines: 139


There has been some discussion recently in comp.sys.mac concerning the relative 
merits of two different methods of storing database files: the monolithic 
(all-in-one-file) way and the distributed (files-all-over-the-place) method. 
Since this issue is very important (to those of us who use databases), and 
apparently not too well understood, I will try to explain why the distributed 
method is far superior to the monolithic method.

The benefits of monolithic structure are few. First, you don't have to look at 
all those files filling up your disk. This is especially important to some 
consultants who want to present a neat appearance to their clients (no joke- 
I've seen people choose not to buy a product for this reason alone). Also, 
there is less chance of losing one file and not noticing it. I do not know of 
any other reasons for using the monolithic structure. Both of the reasons 
listed above are purely aesthetic. As far as I'm concerned, just put everything 
in one folder (subdirectory) and there's no more problem.

Some of the monolithic-structure databases are:
Mac- Omnis 3+, Helix, Fourth Dimension (4D)
PC- R:Base

The benefits to a distributed structure are far more concrete and important. 
First of all, in today's imperfect (i.e., non-fault-tolerant) world, keeping 
all your data files (relations, tables) and indexes in separate files isolates 
any corrupt structures. If you lose power while creating an index on file A, 
that's what you have to fix: the index, and only the index. With the Monolithic 
structure, it is possible (or guaranteed, depending on your particular choice 
of DBMS) that you will corrupt some other portion of your database. Good luck 
finding out what got damaged- it may take you weeks to find it, by which time 
all your backups may have the damage as well. In rare cases you could trash 
your entire database. Fortunately, most DBMSs today are more robust than that. 
The danger, however, never goes away entirely. This reason alone should 
convince most people.

The other overriding reason to use a distributed structure is performance. If 
your DBMS has to go through its own file-management code as well as the OS's, 
it will always be slower than if it only needed to go through the OS. In the 
first case, to write file A the DBMS must first determine where in the logical 
data file (table) the data goes. Then it must find the location of the table in 
the physical database file. Then it can tell the OS to find that sector on the 
disk and update it. With a distributed structure, the middle step doesn't 
exist. This time savings is not immense if the DBMS is very well-written, but 
the vast majority are not (at least when it comes to speed optimization).

There is a much bigger performance gain for distributed structures in a 
multiple-machine or multiple-hard-disk environment. While this is a very 
complex subject, a few examples will make my point. All of these techniques are 
used in systems I have built and they can produce gains of one hundred to 
ten-thousand percent!

Case 1:
There is a very large file with several indices associated with it. Many people 
use this file, and some of those use the indices. Even with a very fast disk, 
access is slow, and the file is too large for a cache to provide significant 
help. The solution is to put the indices on a separate hard disk (same server). 
This results in immense speed improvements. The reasons are as follows:
 1) The CPU is inherently faster than the disk. The bottleneck is in getting
    data off of the disk. By pulling information from two disks at once, data
    can be retrieved twice as fast.
 2) Even more significant is the fact that the disk with the data file is
    doing far fewer seeks than before. Seeking is a disproportionately slow
    operation for computers. The result is a massive performance gain for
    those using the indices. Those who are not using the indices (and are
    generally going through the datafile sequentially) will gain even more.

There are some situations where you can do even better that that. If you have a 
group of files which are generally used for look-up information, and usually 
not written to, you can put them all, along with their indices, on a RAMdisk. 
In one extreme case this resulted in a program performing *100 TIMES* better 
than it had before.

There are all sorts of creative ways to distribute your files for maximum gain. 
Because every situation is different, it is impossible to tell just how much 
faster things will be, but I've seen improvements from a minimum of 100%, to a 
more average 300%, up to 10,000% (!) in extreme cases.


Case 2:
There are many nodes on a LAN all accessing a database, doing fundamentally 
different jobs with the same data. If all of the files (tables) in the database 
are left on the server, you have an especially bad bottleneck problem if any 
node other than the server wants to do some large-scale data manipulation. In 
particular, on slower networks like AppleTalk or StarLAN, or (even worse) on 
leased-line dial-in access, some tasks become impossible, either because they 
will bring the network to its knees or because they cannot be completed before 
it's time to start them again (i.e. a task which must be performed every hour 
taking 90 minutes to complete).

With a monolithic database you are S.O.L., but with a distributed structure you 
can just send the relevant fields (columns) of the relevant records (rows) to a 
hard disk local to the node which needs to manipulate the data. When the task 
is finished, the data is reloaded into the main file(s). Of course there are 
important logistics to consider, such as how to lock out access to data which 
is temporarily invalid because it is being updated privately by that node, but 
often this is not a problem at all. Even when it is, it's better that not being 
able to do the job at all.

This technique, which is actually a form of coarse-grained distributed 
processing, can make possible tasks that are simply beyond the powers of a 
monolithic system (in one case I had to deal with), and can speed up other jobs 
by tens or hundreds of times. Just as importantly, it can lighten the load on 
your network considerably.


There is one other very important reason to use a distributed structure that 
comes to mind. Any monolithic structure will impose arbitrary restrictions on 
the number of data files or fields (tables and columns) allowed in the 
database. Sometimes these restrictions are very severe. Furthermore, if you 
have a very large database, it may not fit on one physical disk, and with the 
monolithic structure you are limited (generally) to one device. With the 
distributed structure, these limitations just go away.

Some of the distributed-structure databases are:
Mac- FoxBase+/Mac, McMax
PC- dBase III+, PowerBase, FoxBase, Clipper, dbXL, etc.
UNIX- Most systems, I think (it's been a long time...)

-----

There are many more things to consider, but I've tried to hit the most 
important ones. There are also exceptions. For example, one of the big DBMSs 
that runs under UNIX (RTI Ingres? Oracle?) bypasses the file system to write 
directly to the disk (let's skip the technical details...). While this is like 
the monolithic system in some ways, it still allows for some of the benefits of 
the distributed structure. UNIX wizards may have a lot to say about this...

If you think I've missed any major points here, or that I've short-changed the 
monolithic-style DBMSs, please let me know why, and give an example.

PS to Mac people- this is another good reason to like FoxBase... The only other 
Mac DMBS that I'm aware of which uses the distributed file structure is McMax, 
which is just a failed FoxBase.

----
Alexis Rosen                       {allegra,philabs,cmcl2}!phri\
Writing from                                {harpo,cmcl2}!cucard!dasys1!alexis
The Big Electric Cat                  {portal,well,sun}!hoptoad/
Public UNIX                         if mail fails: ...cmcl2!cucard!cunixc!abr1
Best path: uunet!dasys1!alexis