Path: utzoo!utgpu!attcan!uunet!pyrdc!pyrnj!rutgers!att!ihlpb!nevin1
From: nevin1@ihlpb.ATT.COM (Liber)
Newsgroups: comp.lang.misc
Subject: Re: Text or data files?
Summary: there is some relevance to this newsgroup
Message-ID: <8853@ihlpb.ATT.COM>
Date: 5 Oct 88 00:47:26 GMT
References: <3967@enea.se>
Reply-To: nevin1@ihlpb.UUCP (55528-Liber,N.J.)
Organization: AT&T Bell Laboratories - Naperville, Illinois
Lines: 198

In article <3967@enea.se> sommar@enea.se (Erland Sommarskog) writes:
>I had the example:
>>>    Data_record = RECORD
>>>                     Date : PACKED ARRAY(.1..8.) OF char;
>>>                     Time : PACKED ARRAY(.1..8.) OF char;
>>>                     Incident       : Incident_type;   (* Enumerated *)
>>>                     No_of_warnings : integer;
>>>                     Alarmed        : boolan;
>>>                     Username       : PACKED ARRAY(.1..12.) OF char;
>>>                  END;
>>>
>>>The simplest way to read and write this is to through a FILE OF Data_record,
>>>if no other programs is to read it.

>Marc W. Mengel (mmengel@cuuxb.UUCP) wrote:
>>Two major problems with this idea.  The first is that most of the time
>>other programs will need to read the data sooner or later.  

>If we have data that are to be read by more than one program, two 
>programs can import the declaration of the data record from a common 
>source, and thus they do not need to be rewritten if the format is 
>changed.

This assumes that all the programs are not only run on the same type of
machine and operating system, but that they are written in the same
language using the same compiler (stuff like pack arrays are not only
*machine* dependent and *operating system* dependent, they are *language*
dependent, *compiler* dependent, and in some cases are even *optimization*
dependent).  This is unnecessarily restrictive, and typically not practical
in commercial environments.

Irregardless of whether I use text files or binary files, I would
rather write my own read/write routines (even if they only call the
standard ones) than be dependent on my compiler.

>Also now we have the
>problem that for one change we have to edit three in places: the read 
>and write routines and the data definition, introducing a source of 
>error.)

But you have gained an interface layer (is it time to throw the
'object-oriented' buzzword around yet? :-))!  Except for the read/write
routines, the rest of the program is independent of the way the data is
stored on disk.  This is by far a big advantage!  (Note:  this
advantage comes from the argument, not from the type of data file
used.)

Suppose you decide to delete one of the fields stored on disk (because
it can be calculated, for instance), but you want the field available
for the rest of the program.  If you didn't bother to put the interface
layer in, this is a maintenance nightmare.

>  If you have many programs that are to read the same data, you are 
>likely to get a database system, and I don't think they store data 
>in a text-file format...

You wouldn't necessarily want a prepackaged DBMS.  There is usually a
lot of overhead associated with DBMS systems, and you have decide
whether it is worth it.

>  The only case when I can see that this argument is valid is when 
>"the other program" is standard a text-oriented utility.   

Well, if you're on a Un*x (sorry about the '*' in place of the 'i',
but Legal is talking about trademark protection again) system, this may
be very desirable.  You can use all your familiar tools (like grep,
sed, etc.) to do many of your manipulations.

>>Second, when
>>files are written in a binary format like this, the same program cannot
>>read the data when run on a different machine with a different byte
>>ordering, so after you have built up a list of 2000 incidents, and have
>>to move to a new machine, you lose big time.  

>A valid point. However, text files are not necessarily compatible either. 
>Imagine that the data record above has a message field, 80 characters 
>long. Assume that the program started its life on VMS and that one of 
>the messages contains a CR-LF. Now we move to a Unix system... And I 
>have seen Pascal systems that gladly read 123 from the line "123ABD", 
           ^^^^^^ need I say more? :-)
>and those who chokes, saying "inavlid integer".

Yes, but this isn't a deficiency of the file format; it is a deficiency
of the implementation of the programming language (I knew this
discussion was somehow relevant to this group :-)).  So far, your only
valid argument for using binary files instead of text files is that it
is cumbersome to do text manipulation with languages such as Pascal,
Modula-2, etc.

>>You have a data file with packed records in it, and you (the programmer) 
>>have *no idea* how the data is actually formatted.

>Isn't this a point? I always thought that a high level of abstraction  
>as possible was a good thing. You don't need to know the actaul disk
>format until you really have a need to move the file.

But some of us don't plan on using the same machine forever (or even
for one year).  I would hate to have to write conversion programs every
time I needed to port something.  The problem with abstraction is that
if the model wasn't designed just right, you typically have to find a way
around it.  I would much rather be able to design the model for
abstraction from the ground up than being forced into using what
someone else thought would be good enough.  Standard Pascal does
not give me these primitives when it comes to files; other languages
do.  All you have done here is point out another problem with the
language, not the data format.

>>It's true, you have to parse some of the data file (the numbers), but
>>even Pascal gives you a means of writing and reading integers of a
>>fixed width.  

>The problem is that you often have little use for these standard 
>routines, unless you can accept that the program crashes because there 
>was a letter where you expected a number.

Again, a deficiency of the programming language, not of the data format.
In C, people use the standard routines with no problems; they don't
ungracefully crash when an error occurs like Wirth-type languages do.

>Storing data in text files 
>gives you a bigger problem with data integrity, than with binary
>files.

Actually, the opposite is true.  Since the effective data is more
compressed in binary formats (if this wasn't true, there would be nothing
that would distinguish text formats from binary formats), it is more likely
that a data error will go by unnoticed.

>>you *can* add records with a text editor, 

>A plus, but applying the text editor is clearly a violence on data 
>integrity.

What makes a text editor a 'violence on data integrity'
any more than someone hacking together a program to modify the data?
The latter is probably worse, since it is much harder to check the
integrity within a program than by just looking at it through an
editor.

Besides, whenever I need a binary format for data, I use a hex-oriented
file editor.  It is an essential debugging tool (especially if the data
gets corrupted).  The existance of this editor has no bearing on my
data integrity.  I take other precautions irregardless of the data
format (eg, setgid to a special group for Un*x-based systems where the data
is to be shared).

>>you can debug your code much more easily, 

>Since I have less code, binary files win here, as long as I have
>good debugger around.

It had better not let you modify your data file!  As you said, that
would be a 'violence on data integrity'.

Also, it is much easier to find an error in a text file than it is in a
data file (why do you use a good debugger in the first place?  So that
you can see a symbolic representation of your program is usually one of
the reasons.  In other words, you need to look at a text format).  The
error you find in the data can be traced back to the program.  It is a
useful debugging technique.  You only have less code by half a dozen
procedures and your code is much more interdependent than mine.  I
would think that your method would lead to more errors than mine.

>You don't have to think so much about 
>integreity checks,

Strike one!  The integrity checks are more complicated.

>you have less problem changing the format during development,

Strike two!  If one of the data elements changes storage formats, the
rest of the program has to be checked for dependencies.

>maintenance benefits from the reduced code volume.

Strike three!  Although there is less NCSL (non commented source
lines), the code is more interdependent and hence more complex,
resulting in higher maintenence costs.

>  What to use is a decision the programmer has to make based on the
>requirements on portability (+ for text), performance (+ for binary),

I agree (finally :-)) with these two.


Most of the points that you brought up came about because it is much
harder to do rigorous text manipulation in the language you were using.
If you are stuck using a restrictive language, then this is a valid
point.  (BTW, I'm not trying to start a C vs Pascal debate.  Different
languages have different strong points and different weaknesses.  Due to
other factors, we can't always use the language best suited for the
task.)  But don't use this to say that text is worse than binary when
the real problem is with the language, not the format.
-- 
 _ __		NEVIN J. LIBER  ..!att!ihlpb!nevin1  (312) 979-4751  IH 4F-410
' )  )  "I catch him with a left hook. He eels over. It was a fluke, but there
 /  / _ , __o  ____  he was, lying on the deck, flat as a mackerel - kelpless!"
/  (_</_\/ <__/ / <_	As far as I know, these are NOT the opinions of AT&T.