Path: utzoo!attcan!uunet!tank!ncar!mailrus!tut.cis.ohio-state.edu!bloom-beacon!athena.mit.edu!scs
From: scs@athena.mit.edu (Steve Summit)
Newsgroups: comp.lang.c
Subject: Re: Portability across architectures..
Keywords: Portability, common data, files
Message-ID: <7038@bloom-beacon.MIT.EDU>
Date: 13 Sep 88 01:29:30 GMT
References: <103@simsdevl.UUCP>
Sender: daemon@bloom-beacon.MIT.EDU
Reply-To: scs@adam.pika.mit.edu (Steve Summit)
Distribution: all
Lines: 91

In article <103@simsdevl.UUCP> dandc@simsdevl.UUCP (Dan DeClerck) writes:
> I've run across a need to have data files in various forms of UN*X
> be portable to each other.
> I could write data out to files in ASCII, but this is cumbersome,
> slow and may hamper the products' marketability.

Please strongly consider using ASCII after all.  The advantages
are many; the disadvantages are comparatively minor.

     1.	ASCII is well-nigh universal; portability is virtually
	assured.  Even if you ever want to go to an EBCDIC
	machine, conversion utilities are bound to be readily
	available (and conversion may indeed happen implicitly
	when transferring a text file to such a machine).

     2.	It's usually not nearly as inefficient as you'd think.
	Ironically, even sophisticated computer programmers
	commonly ignore the fact that computers are just
	blisteringly fast and can usually complete a seemingly
	inefficient ASCII parse in far lees time than it takes
	to think about it.  (I am aware that there are high-
	bandwidth, high-performance systems which cannot afford
	the luxury of an ASCII parse, and are well-advised to use
	binary transfer methods.  I maintain that surprisingly
	many real applications do not fall into this category,
	and can use ASCII without paying a performance penalty.)

     3.	Reading and writing ASCII formats isn't really that
	cumbersome; in fact I'd argue that binary formats, when
	properly designed to account for word ordering and other
	difficulties which ASCII formats easily overcome, are
	more cumbersome in the long run.

     4.	Don't overlook debugging.  ASCII formats can be
	inspected with cat, piped through grep and sed and other
	familiar utilities, patched with ordinary text editors,
	etc., etc.  The first program you write for your binary
	format is usually not the application you were trying to
	write, but the disassembler you find you need for
	debugging; getting the disassembler working is often a
	prerequisite for getting the end application working.

     5.	ASCII formats can make good, backwards-compatible
	version number schemes easy to implement.  Data formats
	inevitably require revision to accommodate new features.
	Fixed binary formats, especially those that simply write
	structures out as bytes, are usually not amenable to such
	changes, unless you did a lot of work to make them
	extensible (which is another aspect that makes binary
	formats more, not less, cumbersome than ASCII).
	Introducing a "version 2" format then requires a host of
	extra translation utilities, and nasty incompatibility
	problems when programs try to read files of the wrong
	format.  (These compatibility problems can be successfully
	worked around, but only if all files contain a version
	number, which is usually not recognized or implemented
	until version 1 is in place and version 2 is being
	contemplated, by which time it's too late.)

	Suppose, on the other hand, that your ASCII format
	consists of arbitrary lines of text, with a keyword at
	the beginning of each line indicating what kind of data,
	(e.g. what field of a structure) that line contains.  If
	programs ignore unrecognizable lines (a good practice),
	"version 1" programs can read "version 2" files without
	modification, if the version 2 keywords are a superset of
	version 1's.  Version 1 filters and editors can even
	modify version 2 files, without losing version-2-specific
	information, by saving, and echoing to the output,
	unrecognized lines without interpretation.

	(It's true that a binary format employing variable-length
	records with a type field in a consistent place would
	also enjoy these advantages.  Such records are in fact
	common in network protocols.)

The only real problem I've ever had with ASCII data interchange
formats is that you tend to lose a bit of precision when reading
and writing doubles, but you can minimize this by printfing
things with %.ne, for n sufficiently large.  If the precision
inherent in the data is less than that of a double, you're only
"losing" something you didn't have in the first place.

I'm not sure how using ASCII data formats could "hamper a
products' marketability."  If not an efficiency concern, it's
probably some attempt to keep information hidden in a cryptic
binary format rather than having it in plain text that anyone
could read.

                                            Steve Summit
                                            scs@adam.pika.mit.edu