Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!ncar!husc6!bloom-beacon!adam.pika.mit.edu!scs
From: scs@adam.pika.mit.edu (Steve Summit)
Newsgroups: comp.lang.c
Subject: Re: binary data files
Message-ID: <11021@bloom-beacon.MIT.EDU>
Date: 2 May 89 06:36:25 GMT
References: <10946@bloom-beacon.MIT.EDU> <12546@ut-emx.UUCP> <8758@csli.Stanford.EDU>
Sender: daemon@bloom-beacon.MIT.EDU
Reply-To: scs@adam.pika.mit.edu (Steve Summit)
Lines: 69

In article <8758@csli.Stanford.EDU> poser@csli.stanford.edu (Bill Poser) writes:
>I agree that in many cases it is desirable to use ASCII data files,
>but in some situations binary is better. One such situation is when
>you need to know how many items are in the file before you read it
>(say to allocate storage). If the data is binary you just
>stat the file and divide by the item size.

Actually, this illustrates another thing it's worth shying away
from if you can.  The assumption that you can determine, without
actually reading them, exactly how many characters a file
contains, can get you in to trouble, although of course it's a
perfectly valid assumption on Unix systems.  Not so on VMS and
MS-DOS and doubtless other lesser systems -- stat() or the
equivalent may only give you an approximation.

A prime example is Unix tar format: a tar file consists of a file
header, followed by a file, followed by a file header, etc.  The
file header contains the (following) file's size; the size must
be exact because the program reading the tar file must use it to
determine where the file ends and the next header begins.  It's
trivial to write the header correctly on Unix: just stat the
file.  If you're trying to create tar files on other systems (a
reasonable thing to do, since tar is an interchange format) you
typically have to read each file twice: once to count the
characters in it, and a second time to copy it to the tar output
file.

The moral is that if you're writing a program that might be
ported to a non-Unix system, don't depend on the ability to find
a file's size, "in advance," without explicitly reading it.

Getting back to data files, it's not necessary to know how big
they are while reading them.  Just use code like the following:

	int nels = 0;
	int nallocated = 0;
	struct whatever *p = NULL;

	while(there's another item) {
		if(nels >= nallocated) {
			nallocated += 10;
			if(p == NULL)
				p = (struct whatever *)malloc(
					nallocated * sizeof(struct whatever));
			else	p = (struct whatever *)realloc((char *)p,
					nallocated * sizeof(struct whatever));

			if(p == NULL)
				complain;
		}

		read item into p[nels];

		nels++;
	}

If realloc can handle a NULL first argument, you can dispense
with the initial test and call to malloc, and always call realloc
(which is why I'm always ranting in favor of this realloc
functionality, which ANSI C incidentally requires).

The on-the-fly reallocation may look inefficient, but "it doesn't
matter much in practice."  (At least for me.  When I'm really
unconcerned with efficiency, I even skip the nallocated += 10
chunking jazz and call realloc for each item read, and that has
never caused problems either.  Your mileage may vary.)

                                            Steve Summit
                                            scs@adam.pika.mit.edu