Path: utzoo!utgpu!attcan!uunet!pyrdc!pyrnj!rutgers!att!ihlpb!nevin1 From: nevin1@ihlpb.ATT.COM (Liber) Newsgroups: comp.lang.misc Subject: Re: Text or data files? Summary: there is some relevance to this newsgroup Message-ID: <8853@ihlpb.ATT.COM> Date: 5 Oct 88 00:47:26 GMT References: <3967@enea.se> Reply-To: nevin1@ihlpb.UUCP (55528-Liber,N.J.) Organization: AT&T Bell Laboratories - Naperville, Illinois Lines: 198 In article <3967@enea.se> sommar@enea.se (Erland Sommarskog) writes: >I had the example: >>> Data_record = RECORD >>> Date : PACKED ARRAY(.1..8.) OF char; >>> Time : PACKED ARRAY(.1..8.) OF char; >>> Incident : Incident_type; (* Enumerated *) >>> No_of_warnings : integer; >>> Alarmed : boolan; >>> Username : PACKED ARRAY(.1..12.) OF char; >>> END; >>> >>>The simplest way to read and write this is to through a FILE OF Data_record, >>>if no other programs is to read it. >Marc W. Mengel (mmengel@cuuxb.UUCP) wrote: >>Two major problems with this idea. The first is that most of the time >>other programs will need to read the data sooner or later. >If we have data that are to be read by more than one program, two >programs can import the declaration of the data record from a common >source, and thus they do not need to be rewritten if the format is >changed. This assumes that all the programs are not only run on the same type of machine and operating system, but that they are written in the same language using the same compiler (stuff like pack arrays are not only *machine* dependent and *operating system* dependent, they are *language* dependent, *compiler* dependent, and in some cases are even *optimization* dependent). This is unnecessarily restrictive, and typically not practical in commercial environments. Irregardless of whether I use text files or binary files, I would rather write my own read/write routines (even if they only call the standard ones) than be dependent on my compiler. >Also now we have the >problem that for one change we have to edit three in places: the read >and write routines and the data definition, introducing a source of >error.) But you have gained an interface layer (is it time to throw the 'object-oriented' buzzword around yet? :-))! Except for the read/write routines, the rest of the program is independent of the way the data is stored on disk. This is by far a big advantage! (Note: this advantage comes from the argument, not from the type of data file used.) Suppose you decide to delete one of the fields stored on disk (because it can be calculated, for instance), but you want the field available for the rest of the program. If you didn't bother to put the interface layer in, this is a maintenance nightmare. > If you have many programs that are to read the same data, you are >likely to get a database system, and I don't think they store data >in a text-file format... You wouldn't necessarily want a prepackaged DBMS. There is usually a lot of overhead associated with DBMS systems, and you have decide whether it is worth it. > The only case when I can see that this argument is valid is when >"the other program" is standard a text-oriented utility. Well, if you're on a Un*x (sorry about the '*' in place of the 'i', but Legal is talking about trademark protection again) system, this may be very desirable. You can use all your familiar tools (like grep, sed, etc.) to do many of your manipulations. >>Second, when >>files are written in a binary format like this, the same program cannot >>read the data when run on a different machine with a different byte >>ordering, so after you have built up a list of 2000 incidents, and have >>to move to a new machine, you lose big time. >A valid point. However, text files are not necessarily compatible either. >Imagine that the data record above has a message field, 80 characters >long. Assume that the program started its life on VMS and that one of >the messages contains a CR-LF. Now we move to a Unix system... And I >have seen Pascal systems that gladly read 123 from the line "123ABD", ^^^^^^ need I say more? :-) >and those who chokes, saying "inavlid integer". Yes, but this isn't a deficiency of the file format; it is a deficiency of the implementation of the programming language (I knew this discussion was somehow relevant to this group :-)). So far, your only valid argument for using binary files instead of text files is that it is cumbersome to do text manipulation with languages such as Pascal, Modula-2, etc. >>You have a data file with packed records in it, and you (the programmer) >>have *no idea* how the data is actually formatted. >Isn't this a point? I always thought that a high level of abstraction >as possible was a good thing. You don't need to know the actaul disk >format until you really have a need to move the file. But some of us don't plan on using the same machine forever (or even for one year). I would hate to have to write conversion programs every time I needed to port something. The problem with abstraction is that if the model wasn't designed just right, you typically have to find a way around it. I would much rather be able to design the model for abstraction from the ground up than being forced into using what someone else thought would be good enough. Standard Pascal does not give me these primitives when it comes to files; other languages do. All you have done here is point out another problem with the language, not the data format. >>It's true, you have to parse some of the data file (the numbers), but >>even Pascal gives you a means of writing and reading integers of a >>fixed width. >The problem is that you often have little use for these standard >routines, unless you can accept that the program crashes because there >was a letter where you expected a number. Again, a deficiency of the programming language, not of the data format. In C, people use the standard routines with no problems; they don't ungracefully crash when an error occurs like Wirth-type languages do. >Storing data in text files >gives you a bigger problem with data integrity, than with binary >files. Actually, the opposite is true. Since the effective data is more compressed in binary formats (if this wasn't true, there would be nothing that would distinguish text formats from binary formats), it is more likely that a data error will go by unnoticed. >>you *can* add records with a text editor, >A plus, but applying the text editor is clearly a violence on data >integrity. What makes a text editor a 'violence on data integrity' any more than someone hacking together a program to modify the data? The latter is probably worse, since it is much harder to check the integrity within a program than by just looking at it through an editor. Besides, whenever I need a binary format for data, I use a hex-oriented file editor. It is an essential debugging tool (especially if the data gets corrupted). The existance of this editor has no bearing on my data integrity. I take other precautions irregardless of the data format (eg, setgid to a special group for Un*x-based systems where the data is to be shared). >>you can debug your code much more easily, >Since I have less code, binary files win here, as long as I have >good debugger around. It had better not let you modify your data file! As you said, that would be a 'violence on data integrity'. Also, it is much easier to find an error in a text file than it is in a data file (why do you use a good debugger in the first place? So that you can see a symbolic representation of your program is usually one of the reasons. In other words, you need to look at a text format). The error you find in the data can be traced back to the program. It is a useful debugging technique. You only have less code by half a dozen procedures and your code is much more interdependent than mine. I would think that your method would lead to more errors than mine. >You don't have to think so much about >integreity checks, Strike one! The integrity checks are more complicated. >you have less problem changing the format during development, Strike two! If one of the data elements changes storage formats, the rest of the program has to be checked for dependencies. >maintenance benefits from the reduced code volume. Strike three! Although there is less NCSL (non commented source lines), the code is more interdependent and hence more complex, resulting in higher maintenence costs. > What to use is a decision the programmer has to make based on the >requirements on portability (+ for text), performance (+ for binary), I agree (finally :-)) with these two. Most of the points that you brought up came about because it is much harder to do rigorous text manipulation in the language you were using. If you are stuck using a restrictive language, then this is a valid point. (BTW, I'm not trying to start a C vs Pascal debate. Different languages have different strong points and different weaknesses. Due to other factors, we can't always use the language best suited for the task.) But don't use this to say that text is worse than binary when the real problem is with the language, not the format. -- _ __ NEVIN J. LIBER ..!att!ihlpb!nevin1 (312) 979-4751 IH 4F-410 ' ) ) "I catch him with a left hook. He eels over. It was a fluke, but there / / _ , __o ____ he was, lying on the deck, flat as a mackerel - kelpless!" / (_