Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!ut-sally!std-unix From: std-unix@ut-sally.UUCP (Moderator, John Quarterman) Newsgroups: comp.std.unix Subject: tar vs. cpio Message-ID: <8280@ut-sally.UUCP> Date: Wed, 17-Jun-87 10:22:34 EDT Article-I.D.: ut-sally.8280 Posted: Wed Jun 17 10:22:34 1987 Date-Received: Sun, 21-Jun-87 09:29:50 EDT Reply-To: std-unix@sally.utexas.edu Lines: 361 Approved: jsq@sally.utexas.edu (Moderator, John Quarterman) Yesterday was 16 June, which was the day I said I would collect tar and cpio comments until. Included below is the revised note for P1003.1, incorporating those comments. I will deliver it to P1003.1 in Seattle Monday. tar vs. cpio IEEE P1003.1 N.___ 17 June 1987 John S. Quarterman Institutional Representative from USENIX usenix!jsq Secretary, IEEE Standards Board Attention: P1003 Working Group 345 East 47th St. New York, NY 10017 In both the Trial Use Standard and the current Draft 10, POSIX sS10.1 describes a data interchange format based on the tar program. That section has appeared in every draft of IEEE 1003.1 in some form and has always been based on tar format. The P1003.1 Working Group has recently received two related proposals regarding that section: one to add cpio format (including old-style, non-ASCII (non c option) format); the other to replace the existing tar-based format with cpio format. Some clarifications were received to the former. It was also proposed verbally in the latest Working Group meeting to drop sS10.1 altogether and let P1003.2 handle the issue. The present note is a response to those proposals. Much of the detail in it is derived from articles posted in the USENET newsgroup comp.std.unix. Those articles are referenced with this format: which gives the volume (always 11) and number of the article, and the name of the submittor. If no submittor name is given, the posting was by the moderator, John S. Quarterman. Thanks to those who submitted articles. However, the content of this note is solely the responsibility of the author. This note is addressed to P1003.1, and is concerned with data interchange formats. Although user interface issues may be of interest to P1003.2, they are not addressed here. There are a number of problems with both cpio formats. First, those related to the non-ASCII format: 1. Numerous parameters, including inode numbers, mode bits, and user and group IDs, are kept in two-byte binary integers. This has historically produced serious byte-order problems when data is moved among systems with different byte orders. Page 2 tar vs. cpio IEEE P1003.1 N.___ 2. The byte-swapping and word-swapping options to the cpio program are inadequate patches; with an ASCII format the problem would not be present. The options are not consistent across versions of the program: in System III, data blocks and file names are byte swapped; in System V, only data blocks are byte swapped. 3. The two-byte integer format limits the range of inode numbers to 0..65535. Many current file systems are bigger than that. Non-ASCII cpio format is clearly not portable and should not even be considered for standardization. There are several problems that occur even with the ASCII cpio format: 1. Many implementations of cpio only look at the lower 16 (or even 15) bits of the inode number, even in ASCII format. This is because the variable that is used to contain the value is declared to be unsigned short, just as in binary format. Thus, even though ASCII cpio format only constrains this number to the range 0..262143, the format is still less than portable. 2. The proposed cpio ASCII format as specified, is not portable because the proposal assumes that sizeof(int) == sizeof(long). 3. The file type is written in a numerical format, making it UNIX specific rather than POSIX specific, since POSIX (and tar) specifies symbolic, rather than numerical, values for file types. 4. Hard links are not handled well, since cpio format does not directly record that two files are linked. If two files that are linked are written in cpio format, two copies will be written. The cpio program detects duplicate files by matching pairs of (h_dev, h_ino) and producing links, but that is done after the fact. (There is a program, afio, that handles cpio format more efficiently in this and other cases than the licensed versions of the program.) Page 3 tar vs. cpio IEEE P1003.1 N.___ 5. Symbolic links are not handled at all, and no type value is reserved for them. This makes cpio useless on a large class of historical implementations (those based on 4.2BSD or its file system) for one of the main purposes of POSIX sS10.1: archiving files for later retrieval and use on the same system. Although it is possible to extend cpio to handle symbolic links, and at least one vendor has done this, the format proposed to P1003.1 is the format in the SVID, and does not handle symbolic links. 6. The cpio format is less common than tar format: there are few historical implementations from Version 7 on that do not have tar; there are many that do not have cpio. It is true that cpio (non-ASCII format) was invented before tar, apparently in PWB System 1.0. The cpio program was first available outside AT&T with PWB/UNIX 1.0, and later with System III. However, in the interim, Version 7, which did not include cpio but did include tar, became the most influential system. There was a V7 addendum tape, but it also did not include cpio (according to its README file); the addendum tape was in tar format. Also, it appears that the cpio format of PWB was not the same as that of System III. And System III and all releases of System V include tar. 7. It is very late in the process to propose that P1003.1 adopt cpio format now, especially considering that it was originally proposed to and rejected by the /usr/group committee before P1003.1 was even formed. Advantages of cpio format include: 1. Both X/OPEN and the SVID use it, although evidently defined somewhat differently. 2. Archives made in cpio format are often smaller than ones in tar format. But this is only because of the headers, and thus the effect Page 4 tar vs. cpio IEEE P1003.1 N.___ diminishes with larger files. 3. On a local (non-networked) system, cpio is more efficient at copying directory trees than tar. However, this is really an implementation issue. There are several advantages to the current tar-based format as specified in sS10.1: 1. There are no byte- or word-swapping issues caused by the format, since all the header values are ASCII byte streams. 2. There are no inode numbers recorded, and file types are kept in symbolic form, so the format is less implementation-specific than cpio format. 3. Historical tar format is the most widely used, as discussed in 6. above, despite apparent assertions to the contrary. 4. The format specified in sS10.1 is upward-compatible with tar format. Old tar archives can be extracted by a program that implements sS10.1. Archives using some of the extensions of sS10.1 can be extracted with old (Version 7) tar programs, although symbolic links will not be extracted and contiguous files will not be handled properly (cpio does not handle these capabilities at all). Files with very long names will not be handled properly (cpio does no better at this). All tar implementations are compatible to this extent. 5. The /usr/group working group and P1003.1 have already done the work required to add optional extensions (such as symbolic links, long file names, and contiguous files) that are needed on many historical implementations and that cpio format lacks. 6. The format is extensible for future facilities. 7. There is a public domain implementation of the format of sS10.1. That implementation provided feedback which led to improvements in the current specification, and has been in use for years in transferring data with licensed tar implementations. Page 5 tar vs. cpio IEEE P1003.1 N.___ 8. Many people prefer the user interface of the cpio program to that of the tar program, because the former can accept a list of pathnames to archive on standard input while the latter takes them as arguments, limiting the length of the list. However, the above-mentioned public domain implementation of tar accepts pathnames on standard input, and at least one vendor sells a version of tar that can do this. Diffs to standard tar to add an option to accept pathnames on standard input when creating an archive have also been posted to USENET. The user interface is, in any case, irrelevant to P1003.1. Disadvantages of tar format: 1. If an attempt is made to extract only the second of a pair of hard linked files the tar program will attempt to link the second file to the nonexistent first file, and nothing will be extracted. Although a sufficiently clever implementation could avoid this, the problem can be considered to be in the archive format. There are some problems that neither tar nor cpio handles well. 1. File names still longer than the length of PATH_MAX (at least 255) that the POSIX format allows (and than the 128 that cpio permits or than the 100 that historical tar allows) would be preferable, although the POSIX limit is useful for most cases. 2. An option to prevent crossing mount points would be useful for backups. However, this appears to be more of an implementation issue than a format issue, especially considering that there are options to find in 4.2BSD, SunOS 3.2, and System V Release 3.0 that take care of this. 3. The default block size in many tar implementations is too large for some tape controllers to read (the 3B20 has this problem). This is not a problem with the interchange format, however. Page 6 tar vs. cpio IEEE P1003.1 N.___ There is nothing that the proposed cpio can handle that the tar-based format already in POSIX sS10.1 cannot handle; in fact, the former is less capable. If cpio format were augmented to handle missing capabilities, it would be subject to the same objections now aimed at the format given in sS10.1: that it was not identical with an existing format. There is no advantage in replacing the current tar-based format of sS10.1 with cpio format. There is also no advantage in adding cpio format, because two standards are not as good as a single standard. Some have recommended removing sS10.1 from POSIX altogether, perhaps with a recommendation for P1003.2 to pick up the idea. While I believe that that would be preferable to adding cpio format, whether or not tar format remains, I recommend leaving sS10.1 as it is, because o+ The inclusion of an archive/interchange file format is in agreement with the purpose of POSIX to promote portability of application programs across interface implementations. Some format will be used. It is to the advantage of the users of the standard for there to be a standard format. o+ The de facto standard is tar format. The current sS10.1 standardizes that, and provides upward-compatible extensions in areas that were previously lacking. The Archive/Interchange File Format should be left as it is. Thank you, John S. Quarterman Volume-Number: Volume 11, Number 67