Xref: utzoo comp.sys.mac:19900 comp.databases:1338 comp.sys.mac.programmer:2265
Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!lll-tis!daitc!jkrueger
From: jkrueger@daitc.daitc.mil (Jonathan Krueger)
Newsgroups: comp.sys.mac,comp.databases,comp.sys.mac.programmer
Subject: Re: Databases: distributed vs. monolithic file structure (was Re: FoxBase)
Message-ID: <172@daitc.daitc.mil>
Date: 1 Sep 88 21:44:15 GMT
References: <6178@dasys1.UUCP>
Reply-To: jkrueger@daitc.daitc.mil.UUCP (Jonathan Krueger)
Organization: Defense Applied Information Technology Center, Alexandria VA
Lines: 124

In article <6178@dasys1.UUCP> alexis@dasys1.UUCP (Alexis Rosen) writes:

>If your DBMS has to go through its own file-management code as well as
>the OS's, it will always be slower than if it only needed to go
>through the OS.

This is incorrect.  The cost of operations like (create, destroy,
find, append, delete, replace) is a function of the data structure
used, not where the code resides.  If you implement a relational
database by putting each table in a file, you use the data structure
structure defined by the filesystem.  If you put all tables into one
large file, you define your own data structure.  Either way, the data
structure will optimize for certain operations and against others.


>In the first case, to write file A the DBMS must first determine where
>in the logical data file (table) the data goes. Then it must find the
>location of the table in the physical database file. Then it can tell
>the OS to find that sector on the disk and update it. With a
>distributed structure, the middle step doesn't exist.

This is also incorrect.

		One Large File		One File per Table
		==============		==================
find table	seek through file	open file
find row	seek through file	seek through file
update row	write location		write location

>There is a much bigger performance gain for distributed structures in
>a multiple-machine or multiple-hard-disk environment.  There is a very
>large file with several indices associated with it. Many people use
>this file, and some of those use the indices. Even with a very fast
>disk, access is slow, and the file is too large for a cache to provide
>significant help. The solution is to put the indices on a separate
>hard disk (same server).  This results in immense speed improvements.

The assumption seems to be that you can't do this with a single big
file.  People do this all the time: it's called disk striping.
Breaking data into smaller files has little to do with spreading data
across multiple spindles.

>If you have a group of files which are generally used for look-up
>information, and usually not written to, you can put them all, along
>with their indices, on a RAMdisk.

Or you can use your RAM and your time more profitably: disk caching,
virtual memory, code profiling, and dare we say it, database design.

>There are many nodes on a LAN all accessing a database, doing
>fundamentally different jobs with the same data. If all of the files
>(tables) in the database are left on the server, you have an
>especially bad bottleneck problem if any node other than the server
>wants to do some large-scale data manipulation. In particular, on
>slower [paths] With a monolithic database you are S.O.L.

This is incorrect.  Again, you can distribute pieces of a file to
multiple nodes just as you can spread it across multiple disks.

>but with a distributed structure you can just send the relevant fields
>(columns) of the relevant records (rows) to a hard disk local to the
>node which needs to manipulate the data. When the task is finished,
>the data is reloaded into the main file(s).

If you can't update tables without regard to physical location, it's
not a distributed structure.  If you trade transparent access for
application-specific speed, you're behind on the deal.  For instance:

>Of course there are important logistics to consider, such as how to
>lock out access to data which is temporarily invalid because it is
>being updated privately by that node

If you have to do your own record locking, hardcode data location
into applications, and retune performance for every new disk, why use
a database manager at all?

>but often this is not a problem at all. Even when it is, it's better
>that not being able to do the job at all.

Wrong twice. It's better to get a slow answer than a wrong one.  And
in no sense is one prevented from doing the job, adequate tools exist.

>This technique...can lighten the load on your network considerably.

This depends entirely on how well you predict which pieces of data
will be needed where.  If you guess poorly, it will increase the load.
Again, this has nothing to do with whether you partition your data by
table or other unit.  For instance, you could keep one table per file
and split each file among multiple nodes.

>There is one other very important reason to use a distributed
>structure that comes to mind. Any monolithic structure will impose
>arbitrary restrictions on the number of data files or fields (tables
>and columns) allowed in the database.

This is incorrect.  Again, the data structure determines whether you
can implement fixed or flexible field sizes, field or row width
limits, restrictions on number of fields or tables.  Consider a
monolithic tree.

>If you have a very large database, it may not fit on one physical
>disk, and with the monolithic structure you are limited (generally) to
>one device. With the distributed structure, these limitations just go
>away.

This is incorrect.  Several commercially available operating systems
support disk striping, bound volume sets, and the like.  For those
that don't, the limitations don't "just go away": what happens when a
single table must grow larger than the disk?

No, the reason why it's convenient to put each table into a file is
that we have a lot of tools that act on files.  It's good software
engineering to use them on tables.  For instance, the directory
listing program usually provides file size; in the monolithic
structure, that functionality has to be provided elsewhere.  For
another instance, the backup/restore utility knows how to restore
files.  For the monolithic structure, that complicates its ability to
recover from disasters.

-- Jon
-- 
Jonathan Krueger  uunet!daitc!jkrueger  jkrueger@daitc.arpa  (703) 998-4777

Inspected by: No. 15