Path: utzoo!attcan!uunet!dasys1!alexis
From: alexis@dasys1.UUCP (Alexis Rosen)
Newsgroups: comp.databases
Subject: Re: Databases: separate-file vs. monolithic file structure
Message-ID: <6299@dasys1.UUCP>
Date: 7 Sep 88 09:16:34 GMT
Lines: 159


Recently, Tim Wood wrote a response to my article on database file structures
that was, I feel, much more on-target than the other response by Jon Krueger.
He missed the same thing as Jon, though- that I was really addressing smaller
machines. At any rate, I specifically excluded DBMSs which did raw I/O from my
analysis.

BTW, the raw I/O stuff looks like distributed structure to me. In fact, the 
real breakdown, I guess, is between products that layer two levels of file 
access on top of each other vs. the ones that only have one layer (i.e. they 
either use the native file system for each of their logical files, or do raw 
I/O and use their own file system).

In article <861@sybase.sybase.com>, Tim Wood (tim@linus.sybase.com) writes:
>In article <6178@dasys1.UUCP> mtxinu!uunet!dasys1!alexis writes:
>>There has been some discussion recently in comp.sys.mac concerning the
>>relative merits of two different methods of storing database files: the
>>monolithic(all-in-one-file) way and the distributed 
>(files-all-over-the-place) >method. [etc.]
>>The benefits of monolithic structure are few.
>
>It's dangerous to lead with your conclusion.  Your position is not
>100% false or true.  Mono vs. separate file performance is very
>dependent on the underlying software platform (i.e. OS) and
>its implementation.

Very true. But note my previous comments- for the general PC market, which is 
(right now) MS-DOS and MacOS, my original comclusions are true...

>                    I object to the use of "distributed" also;
>this term as used with DBMS's refers to a database that is
>physically fragmented across > 1 machine, but allows the
>user transparent access to it, i.e. access not requiring knowlege of
>or even showing the physical dispersion.

Also true. I was aware of this use of the term, which is why I was careful to 
describe exactly what I meant. Still, if you've got a clearer term, I'd be glad 
to adopt it. I'm not sure that just 'separate' fits the bill...

>I should mention I am from the POV of a transaction-processing
>relational DBMS vendor.  Many users, much traffic, much multi-user
>contention.  Issues are different from the PC/Mac DBMS ones.

I wish they weren't... I'd give an arm and a leg for a good fast transaction 
processing DBMS for the Mac. FoxBase has some promise, but it ain't there yet, 
not by a long shot. So when will Sybase run under Mac's OS, Tim?

> [Makes a good point about invulnerability of large-system DBMSs which has,
>  unfortunately for me, little to do with PCs]
>
>>Good luck
>>finding out what got damaged- it may take you weeks to find it,
>
>Unless your system is properly instrumented, with automatic checkpointng,
>and properly managed with regular backups-

Again, I just wish this kind of thing were available on smaller systems. I have 
a Mac II with 5 MB of RAM and 300 MB of disk. This is considerably more 
horsepower than a VAX 11/750 which I used to use. Why aren't such tools 
available?  The same could be said for the 25MHz '386 systems out there.

>>The other overriding reason to use a distributed structure is performance. If
>>your DBMS has to go through its own file-management code as well as the OS's,
>>it will always be slower than if it only needed to go through the OS.
>
>Unless you don't go thru the OS code.  Know what a "raw disk" (in UNIX)
>or a "foreign mounted" disk (in VMS) is/are?
> [followed by fairly good discussion of why raw I/O is Good.]

I specifically mentioned that raw I/O was excepted from my analysis.

>>There is a much bigger performance gain for distributed structures in a
>>multiple-machine or multiple-hard-disk environment. ...
>>	[ discussion of multiple-disk interleaving, RAM-disks,
>>	  arm-waving around the problem of multi-user access management:
>>...Of course there are
>>important logistics to consider, such as how to lock out access to data which
>>is temporarily invalid because it is being updated privately by that node,
>>but often this is not a problem at all. Even when it is, it's better that not
>>being able to do the job at all. ]
>
>Unless you don't mind inconsistent transactions and corrupt data.

Since Jon misunderstood this remark also, I guess I was very unclear here. 
What I was saying was that in many cases, with canned applications, you have 
knowledge that certain tables will never be written to except by one user 
(exceept on weekends, on Feb. 29, and during a Solar eclipse. Whatever...)
In this case, you need not establish a lock and can just do whateer you like 
without worries. This is what I meant by it being 'not a problem'.

>As for the disk strategies, they're fine.  I should mention that a
>mono organization should not prohibit you from using other disk
>devices (or files).  [etc.]

Well, it's easy enough to do disk striping, but I'm not convinced that the DBMS 
is always going to be smarter than I am. What about prioritization? I want 
certain users of table A, who don't need indices I, J, K, and L, to have very 
fast response time. Other users of A, who need I J K and L, can go hang because 
they aren't upper management. :-) In this case I want to put A on one spindle 
and everything else on another. I am not saying that it's impossible to write a 
DBMS which can handle things like this. I am saying that that would involve a 
fair amount of AI, and I haven't seen anything like it yet.

>>There is one other very important reason to use a distributed structure that
>>comes to mind. Any monolithic structure will impose arbitrary restrictions on
>>the number of data files or fields [etc.]
>
>Continued simplistic tone.  OS's are software.  DBMS's are software.
>Why can't a DBMS be written with the same liberal (or practically nonexistent)
>limits that the magical OS managing the separate-file structure has?

Very true. I got carried away bitching about the current state of the art in
PCs. This is, of course, not necessarily the case with all systems. My mistake.

I wonder, though, how many large DMBSs do have (virtually) unbounded file size?
Surely Sybase does (;-) but what of the others? Years ago I used a DBMS on a
VAX which had precisely this problem. Don't remember which one it was, though.

>>....For example, one of the big DBMSs
>>that runs under UNIX (RTI Ingres? Oracle?) bypasses the file system to write
>>directly to the disk (let's skip the technical details...). While this is
>>like the monolithic system in some ways, it still allows for some of the
>>benefits of the distributed structure. UNIX wizards may have a lot to say 
>>about this...
>
>That's us (at least).  [Toots own horn, has probably earned that right]
>Really the issue here is OS independence, not how many files you use.
>Our experience is that we know best how to deploy system resources
>to run our software, not the writers of the host operating system,
>fine as that OS may be for other purposes.
>
>-TW

Yes. As I wrote at the beginning of this article, raw I/O fits my 'distributed' 
model more closely than it does the 'monolithic' model. Excluding this, though, 
does anyone think that mono structures have any big advantages?

Anyway I am glad that you have taken the time to write the perfect DBMS 
back-end. Now all you need to do is sell it for Macs and PCs (not under unix, 
either) and I'll be very happy.

---
Whew. Now, let me ask a question without making any sweeping statements:

One point we all missed is the possibility of combining multiple tightly- 
related tables on the same physical portion of the disk. The speed advantage 
here might be worth a mono structure (but I would guess not). Some SQLs allow 
you to create "clusters". Now, what exactly is the DBMS writing to the disk? 
The only way I can think of for doing this is to actually store a join of the 
two files. This could chew up an enourmous amount of disk space. What is it 
doing?


----
Alexis Rosen                       {allegra,philabs,cmcl2}!phri\
Writing from                                {harpo,cmcl2}!cucard!dasys1!alexis
The Big Electric Cat                  {portal,well,sun}!hoptoad/
Public UNIX                         if mail fails: ...cmcl2!cucard!cunixc!abr1
Best path: uunet!dasys1!alexis