Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ut-sally!std-unix From: std-unix@ut-sally.UUCP (Moderator, John Quarterman) Newsgroups: comp.std.unix Subject: tar vs. cpio Message-ID: <8188@ut-sally.UUCP> Date: Mon, 1-Jun-87 22:25:08 EDT Article-I.D.: ut-sally.8188 Posted: Mon Jun 1 22:25:08 1987 Date-Received: Wed, 3-Jun-87 04:11:49 EDT Reply-To: std-unix@sally.utexas.edu Lines: 309 Approved: jsq@sally.utexas.edu (Moderator, John Quarterman) Included below is a draft proposal for IEEE P1003.1 regarding the recently raised issue of Archive/Data Interchange Format. I will deliver a proposal resembling it to P1003.1 at their next meeting, which is three weeks from today, in Seattle. Note two things: this is a proposal for P1003.1, not P1003.2, or any other group; if you disagree with my conclusions, you can submit your own proposal-- the address is below. If you agree with my approach but think it needs adjusting, you can send me mail or submit articles. If you disagree, you can also do those things. tar vs. cpio IEEE P1003.1 N.___ 1 June 1987 John S. Quarterman Institutional Representative from USENIX usenix!jsq Secretary, IEEE Standards Board Attention: P1003 Working Group 345 East 47th St. New York, NY 10017 In both the Trial Use Standard and Draft 10, POSIX sS10.1 describes a data interchange format based on the tar program. That section has appeared in every draft of IEEE 1003.1 in some form and has always been based on tar format. The P1003.1 Working Group has recently received two related proposals regarding that section: one to add cpio format (including old-style, non-ASCII (non c option) format); the other to replace the existing tar-based format with cpio format. Some clarifications were received to the former. It was also proposed verbally in the latest Working Group meeting to drop sS10.1 altogether and let P1003.2 handle the issue. The present note is a response to those proposals. Much of the detail in it is derived from articles posted in the USENET newsgroup comp.std.unix. Those articles are referenced with this format: which gives the volume (11) and number of the article, and the name of the submittor. If no submittor name is given, the posting was by the moderator, John S. Quarterman. Thanks to those who submitted articles. However, the content of this note is solely the responsibility of the author. There are a number of problems with both cpio formats. First, those related to the non-ASCII format: 1. Numerous parameters, including inode numbers, mode bits, and user and group IDs, are kept in two-byte binary integers. This has historically produced serious byte-order problems when data is moved among systems with different byte orders. 2. The byte-swapping and word-swapping options to the cpio program are inadequate patches; with an ASCII format the problem would not be present. The options are not consistent across versions of the program: in Page 2 tar vs. cpio IEEE P1003.1 N.___ System III, data blocks and file names are byte swapped; in System V, only data blocks are byte swapped. 3. The two-byte integer format limits the range of inode numbers to 1..65535. Many current file systems are bigger than that. Non-ASCII cpio format is clearly not portable and should not even be considered for standardization. There are several problems that occur even with the ASCII cpio format: 1. Many implementations of cpio only look at the lower 16 (or even 15) bits of the inode number, even in ASCII format. This is because the variable that is used to contain the value is declared to be unsigned short, just as in binary format. Thus, even though ASCII cpio format does not constrain this number, it is still less than portable. 2. The proposed cpio ASCII format as specified, is not portable because the proposal assumes that sizeof(int) == sizeof(long). 3. The file type written in a numerical format, making it UNIX specific rather than POSIX specific, since POSIX (and tar) specifies symbolic, rather than numerical, values for file types. 4. Hard links are not handled well, since cpio format does not record that two files are linked. If two files that are linked are written in cpio format, two copies will be written. There is an option to the cpio program to detect duplicate files by matching pairs of (h_dev, h_ino) and producing links, but that is done after the fact. (There is a program, afio, that handles cpio format more efficiently in this and other cases than the licensed versions of the program.) 5. Symbolic links are not handled at all, and no type value is reserved for them. This makes cpio useless on a large class of historical implementations (those based on 4.2BSD or its file system) for one of the main purposes of POSIX sS10.1: archiving files for later retrieval and use on the same system. Page 3 tar vs. cpio IEEE P1003.1 N.___ 6. The cpio format is less common than tar format: there are few historical implementations from Version 7 on that do not have tar; there are many that do not have cpio. It is true that cpio (non-ASCII format) was invented before tar, apparently in PWB System 1.0. However, cpio was not available outside AT&T before the release of System III, while tar was in wide use with Version 7 and is still much more common. Also, it appears that the cpio format of PWB was not the same as that of System III. Although System III and perhaps early releases of System V did not include tar, current releases of System V do. 7. It is very late in the process to propose that P1003.1 adopt cpio format now, especially considering that it was originally proposed to and rejected by the /usr/group committee before P1003.1 was even formed. There are several advantages to the current tar-based format as specified in sS10.1: 1. There are no byte- or word-swapping issues caused by the format, since all the header values are ASCII byte streams. 2. There are no inode numbers recorded, and file types are kept in symbolic form, so the format is less implementation-specific than cpio format. 3. Historical tar format is the most widely used, as discussed in 6. above, despite apparent assertions to the contrary. 4. The format specified in sS10.1 is upward-compatible with tar format. Old tar archives can be extracted by a program that implements sS10.1. Archives using some of the extensions of sS10.1 can be extracted with old (Version 7) tar programs, although symbolic links will not be extracted and contiguous files will not be handled properly (cpio does not handle these capabilities at all). Files with very long names will not be handled properly (cpio does no better at this). All tar implementations are compatible to this extent. Page 4 tar vs. cpio IEEE P1003.1 N.___ 5. The /usr/group working group and P1003.1 have already done the work required to add optional extensions (such as symbolic links, contiguous files, and long file names) that are needed on many historical implementations and that cpio format lacks. 6. The format is extensible for future facilities. 7. There is a public domain implementation of the format of sS10.1. That implementation provided feedback which led to improvements in the current specification, and has been in use for years in transferring data with licensed tar implementations. 8. Many people prefer the user interface of the cpio program to that of the tar program, because the former can accept a list of pathnames to archive on standard input while the latter takes them as arguments, limiting the length of the list. However, the above-mentioned public domain implementation of tar accepts pathnames on standard input. Diffs to standard tar to add an option to accept pathnames on standard input when creating an archive have also been posted to USENET. The user interface is, in any case, irrelevant to P1003.1. There are some problems that neither tar nor cpio handles well. 1. An option to prevent crossing mount points would be useful for backups. However, this appears to be more of an implementation issue than a format issue, especially considering that there are options to find in 4.2BSD, SunOS 3.2, and System V Release 3.0 that take care of this. 2. The default block size in many tar implementations is too large for some tape controllers to read (the 3B20 has this problem). This is not a problem with the interchange format, however. There is nothing that the proposed cpio can handle that the tar-based format already in POSIX sS10.1 cannot handle; in Page 5 tar vs. cpio IEEE P1003.1 N.___ fact, the former is less capable. If cpio format were augmented to handle missing capabilities, it would be subject to the same objections now aimed at the format given in sS10.1: that it was not identical with an existing format. There is no advantage in replacing the current tar-based format of sS10.1 with cpio format. There is also no advantage in adding cpio format, because two standards are not as good as a single standard. Some have recommended removing sS10.1 from POSIX altogether, perhaps with a recommendation for P1003.2 to pick up the idea. While I believe that that would be preferable to adding cpio format, whether or not tar format remains, I recommend leaving sS10.1 as it is, because o+ The inclusion of an archive/interchange file format is in agreement with the purpose of POSIX to promote portability of application programs across interface implementations. Some format will be used. It is to the advantage of the users of the standard for there to be a standard format. o+ The de facto standard is tar format. The current sS10.1 standardizes that, and provides upward-compatible extensions in areas that were previously lacking. The Archive/Interchange File Format should be left as it is. Thank you, John S. Quarterman Volume-Number: Volume 11, Number 41