Xref: utzoo comp.sys.mac:19900 comp.databases:1338 comp.sys.mac.programmer:2265 Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!lll-tis!daitc!jkrueger From: jkrueger@daitc.daitc.mil (Jonathan Krueger) Newsgroups: comp.sys.mac,comp.databases,comp.sys.mac.programmer Subject: Re: Databases: distributed vs. monolithic file structure (was Re: FoxBase) Message-ID: <172@daitc.daitc.mil> Date: 1 Sep 88 21:44:15 GMT References: <6178@dasys1.UUCP> Reply-To: jkrueger@daitc.daitc.mil.UUCP (Jonathan Krueger) Organization: Defense Applied Information Technology Center, Alexandria VA Lines: 124 In article <6178@dasys1.UUCP> alexis@dasys1.UUCP (Alexis Rosen) writes: >If your DBMS has to go through its own file-management code as well as >the OS's, it will always be slower than if it only needed to go >through the OS. This is incorrect. The cost of operations like (create, destroy, find, append, delete, replace) is a function of the data structure used, not where the code resides. If you implement a relational database by putting each table in a file, you use the data structure structure defined by the filesystem. If you put all tables into one large file, you define your own data structure. Either way, the data structure will optimize for certain operations and against others. >In the first case, to write file A the DBMS must first determine where >in the logical data file (table) the data goes. Then it must find the >location of the table in the physical database file. Then it can tell >the OS to find that sector on the disk and update it. With a >distributed structure, the middle step doesn't exist. This is also incorrect. One Large File One File per Table ============== ================== find table seek through file open file find row seek through file seek through file update row write location write location >There is a much bigger performance gain for distributed structures in >a multiple-machine or multiple-hard-disk environment. There is a very >large file with several indices associated with it. Many people use >this file, and some of those use the indices. Even with a very fast >disk, access is slow, and the file is too large for a cache to provide >significant help. The solution is to put the indices on a separate >hard disk (same server). This results in immense speed improvements. The assumption seems to be that you can't do this with a single big file. People do this all the time: it's called disk striping. Breaking data into smaller files has little to do with spreading data across multiple spindles. >If you have a group of files which are generally used for look-up >information, and usually not written to, you can put them all, along >with their indices, on a RAMdisk. Or you can use your RAM and your time more profitably: disk caching, virtual memory, code profiling, and dare we say it, database design. >There are many nodes on a LAN all accessing a database, doing >fundamentally different jobs with the same data. If all of the files >(tables) in the database are left on the server, you have an >especially bad bottleneck problem if any node other than the server >wants to do some large-scale data manipulation. In particular, on >slower [paths] With a monolithic database you are S.O.L. This is incorrect. Again, you can distribute pieces of a file to multiple nodes just as you can spread it across multiple disks. >but with a distributed structure you can just send the relevant fields >(columns) of the relevant records (rows) to a hard disk local to the >node which needs to manipulate the data. When the task is finished, >the data is reloaded into the main file(s). If you can't update tables without regard to physical location, it's not a distributed structure. If you trade transparent access for application-specific speed, you're behind on the deal. For instance: >Of course there are important logistics to consider, such as how to >lock out access to data which is temporarily invalid because it is >being updated privately by that node If you have to do your own record locking, hardcode data location into applications, and retune performance for every new disk, why use a database manager at all? >but often this is not a problem at all. Even when it is, it's better >that not being able to do the job at all. Wrong twice. It's better to get a slow answer than a wrong one. And in no sense is one prevented from doing the job, adequate tools exist. >This technique...can lighten the load on your network considerably. This depends entirely on how well you predict which pieces of data will be needed where. If you guess poorly, it will increase the load. Again, this has nothing to do with whether you partition your data by table or other unit. For instance, you could keep one table per file and split each file among multiple nodes. >There is one other very important reason to use a distributed >structure that comes to mind. Any monolithic structure will impose >arbitrary restrictions on the number of data files or fields (tables >and columns) allowed in the database. This is incorrect. Again, the data structure determines whether you can implement fixed or flexible field sizes, field or row width limits, restrictions on number of fields or tables. Consider a monolithic tree. >If you have a very large database, it may not fit on one physical >disk, and with the monolithic structure you are limited (generally) to >one device. With the distributed structure, these limitations just go >away. This is incorrect. Several commercially available operating systems support disk striping, bound volume sets, and the like. For those that don't, the limitations don't "just go away": what happens when a single table must grow larger than the disk? No, the reason why it's convenient to put each table into a file is that we have a lot of tools that act on files. It's good software engineering to use them on tables. For instance, the directory listing program usually provides file size; in the monolithic structure, that functionality has to be provided elsewhere. For another instance, the backup/restore utility knows how to restore files. For the monolithic structure, that complicates its ability to recover from disasters. -- Jon -- Jonathan Krueger uunet!daitc!jkrueger jkrueger@daitc.arpa (703) 998-4777 Inspected by: No. 15