Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!snorkelwacker.mit.edu!thunder.mcrcim.mcgill.edu!mouse From: mouse@thunder.mcrcim.mcgill.edu (der Mouse) Newsgroups: comp.unix.programmer Subject: Re: Unix binary/text files: is there a difference? Message-ID: <1991Mar26.070724.1135@thunder.mcrcim.mcgill.edu> Date: 26 Mar 91 07:07:24 GMT References: <77384@bu.edu.bu.edu> Distribution: na Organization: McGill Research Centre for Intelligent Machines Lines: 51 In article <77384@bu.edu.bu.edu>, jdubb@bucsf.bu.edu (jay dubb) writes: > I've looked in a bunch of C and Unix books, and can't seem to find a > good explanation of this - maybe someone can help... Is there a way > to tell (from a C program) whether a given file contains text or > data? No. It's not a well-defined distinction, for one thing. Many files are both text and data - any file interpreted by a program can be considered data.... > The reason I'd like to know, is that I've noticed that if you have a > file into which you have done something like > write(fid,&an_int,sizeof(int)) and then you take this file to another > machine via FTP (in binary mode), and try to read() the int back, it > doesn't work (because of byte-order differences, I assume). Possibly size differences as well; sometimes an int is only 16 bits. > So, what I'd like to know is, is there a difference (in terms of > something stat() could tell me, for example) between straight text > files and files which contain raw numbers (without searching through > the whole file to check, hopefully)? No. The only distinction is the contents. (It's true that executable binaries typically have their execute bits turned on, but so do shell scripts, and many binary files don't.) UNIX is not a system like VMS, with lots and lots of structure imposed on file contents by the filesystem. > the 'file' command seems to be able to do this - I've tried it on a > text file, and on a file with raw ints and floats, and it says "text" > and "data" respectively. Does it really know, or is it making a guess It is making a guess based on reading some small portion of the file (typically the first 1K or 4K or so) and applying various heuristics. Often there is a file which describes various identifiable patterns, such as the 0x1f 0x9d in the first two bytes of a compressed file, but that's a frill for the purposes under discussion. You were also lucky. If your int happened to have the value 0x0a6f6f66 (175075174 in decimal) on a little-endian machine, a data file containing just that int will look like a text file with just one line reading "foo". Of course, the chance of this goes down sharply with the number of "raw" numbers being written, and other factors, but you get the idea. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu