Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!a.cs.uiuc.edu!m.cs.uiuc.edu!wsmith From: wsmith@m.cs.uiuc.edu Newsgroups: comp.lang.misc Subject: Re: Text or data files? Message-ID: <5200029@m.cs.uiuc.edu> Date: 12 Oct 88 12:28:00 GMT References: <3958@enea.se> Lines: 87 Nf-ID: #R:enea.se:3958:m.cs.uiuc.edu:5200029:000:4723 Nf-From: m.cs.uiuc.edu!wsmith Oct 12 07:28:00 1988 >>Uh? If you have a text file and change the format you have to rewrite >>the parsing and the writing-to file parts. With a fixed format you >>change the declaration, and that's all. (Well, you may have to write >>a simple program to convert old files, but you have do to that for >>text files too.) > >What's so tough about fixed format text parsing? Every language known >to mankind can read and write integers, etc. from a file; The format >is easily extensible, you *can* add records with a text editor, you >can debug your code much more easily, you can write programs in other >languages or on other machines that can read your data files, you can >use Unix utilities like grep and sed and awk to make useful reports... >The list goes on and on. I would like to address an orthogonal issue. It can be implemented as either text files or as binary files equally well. "How do you portably save a data structure with pointers?" If the data structure has a canonical order, it may be traversed via some implicit spanning tree and all of the pointers could be stored in the data structure implicitly. If the data structure is more complex, I use some overhead in each record to hold an integer mark that I use while traversing the data structure. First, I set each mark to a negative number (using the same marking algorithm as I use to traverse the data structure). Once all the marks are negative, I traverse the data structure again. This time, I output all of the data plus an integer for each pointer. If the mark for the pointer is positive, I output the mark. If the mark is negative, I reset the mark to the next positive available mark, output that number and add the number to a queue of unprocessed records. At the end the queue will be empty. This algorithm is described in "Marking Algorithms", Lars-Erik Thorelli, BIT 12 (1972) pp. 555-68. To read the data structure back, I create an array with one null pointer for each record in the data structure. (In the first pass that sets the marks to a negative number, I also keep of count of the number of records. The first integer output is the number of records so that I may allocate the array when I start reading in the data.) When a pointer is read in, this array is consulted. If the pointer in non-null, the record has been read in and that pointer may be used as a reference to the actual record. If the pointer is null, the record has not been read in yet. Save the index in a queue along with a reference to the pointer that needs to be fixed when the record is read in. After a record is read in, the next element of the queue is read. If the pointer has already been filled in, just fixup the reference with the value from the array, otherwise, read it in as a new record. Continue until the queue is empty. > >Binary data saves you a whole 20 minutes to an hour when writing >the program, makes your data files unportable between machines, >makes you have to use the same language you started with on the >same machine as long as you want to use the same data files... >This list also goes on and on. > If you do more work on your binary data format you can make it portable across architectures. The hard porting problems are ASCII vs. EBCDIC and a computer with non-8 bit bytes and will cause some portability headaches, but they are not impossible. The trick, I think, is to define a standard byte stream format for each data type (including floating point, if you use it). Then, when you want to write a given data type into the file, you call a subroutine for that specific data type. The subroutine could write in binary, ASCII or some home-brew bcd format because that is merely an implementation detail. Pointers will map onto the integer type as described above. In addition, by representing the high-level file data format at the beginning of a file, as part of the file format definition, you can write a data editor that provides an interface to the data in the program. Once a data structure is complicated enough, for example, one equivalent to an arbitrary multigraph, editing a text file will be a daunting task to begin with. If your application is complex enough to need a bigger and better file format, it will be worthwhile to put the effort in to allow for some form of data translation from version 1.1 to 1.2 of your software. With the data format inside the file, you are safe from the program and data becoming inconsistent because the program will tell you when that happens and may even automatically compensate. >-- > Marc Mengel > mmengel@cuuxb.att.com > attmail!mmengel > {lll-crg|mtune|att}!cuuxb!mmengel Bill Smith wsmith@cs.uiuc.edu uiucdcs!wsmith