Path: utzoo!utgpu!water!watmath!clyde!rutgers!mit-eddie!bloom-beacon!think!ames!oliveb!sun!gorodish!guy
From: guy@gorodish.Sun.COM (Guy Harris)
Newsgroups: comp.arch
Subject: Re: RISC data alignment
Message-ID: <39815@sun.uucp>
Date: 24 Jan 88 01:04:00 GMT
References: <2635@calmasd.GE.COM> <3246@psuvax1.psu.edu>
Sender: news@sun.uucp
Lines: 130

> >If this is true, then it would seem to also be true that a C structure
> >could have different lengths, depending on whether it was compiled 
> >on a RISC or non-RISC machine.

True, but not necessarily for reasons having to do with RISC vs. non-RISC:

	1) I know of one CISC that requires 4-byte alignment of 4-byte
	   quantities, and 2-byte alignment of 2-byte quantities: the WE32100.

	2) While the VAX does not impose any alignment restrictions, I think
	   most, if not all, VAX implementations run faster if 4-byte
	   quantities are aligned on 4-byte boundaries and 2-byte quantities
	   are aligned on 2-byte boundaries.

As such, both the VAX UNIX C compiler and the WE32K C compiler, and probably
the VAX/VMS C compiler, align 4-byte quantities in structures on 4-byte
boundaries and 2-byte quantities in structures on 2-byte boundaries.  The
structure as a whole is aligned on the boundary required by its most strictly
aligned member.  These are the same rules used by the SPARC C compiler;
however, on the SPARC *8*-byte quantities (e.g., double-precision floating
point numbers) must be aligned on *8*-byte boundaries.  These restrictions are
not imposed by e.g. the WE32K nor the VAX, so they only align them on 4-byte
boundaries.

However, there are machines with different alignment restrictions, and C
compilers with different alignment rules:

	1) The MC68010 requires 2-byte quantities to be aligned on 2-byte
	   boundaries, but does not require 4-byte quantities to be aligned on
	   4-byte boundaries.  Most of the C compilers for UNIX 68K
	   implementations put 4-byte quantites only on 2-byte boundaries, and
	   always align structures on 2-byte boundaries even if no member
	   requires this alignment.  These rules are often propagated to the
	   68020, which imposes no alignment restrictions.

	2) The CCI Power 6/32 C compiler, last time I dealt with it, always
	   aligns structures on at least 4-byte boundaries.

> >Further, it would seem that if that C structure were written out to a file,
> >it could only be read properly by a machine of the same type as that which
> >wrote it.
> 
> This is exactly correct.

And not only that, it would still be true even if all C implementations imposed
the exact same alignment rules!  VAXes, National Semiconductor 32Ks, and Intel
80*86es address the bytes within a 2-byte or 4-byte quantity from bottom to
top; the least significant byte is byte 0.  These architectures are called
"little-endian".  IBM 360/370s, Motorola 68Ks, AT&T WE32Ks (except for the
WE32000), SPARCs, and CCI Power 6/32s address them from top to bottom; the
*most* significant byte is byte 0.  These architectures are called
"big-endian".  The WE32000, and, if I remember correctly, the MIPS chips, can
select which byte order to use, although I think all WE32000 implementations
use the "big-endian" byte order.

Tapes, disks, and networks are usually byte-serial.  They generally do not
record (in the case of tapes and disks) or transmit (in the case of networks)
2-byte or 4-byte quantities in parallel.

This means that a sequence of *bytes* will, when copied via tape or disk or
transmitted over a network, from a big-endian to a little-endian machine,
appear the same.  If you put the character string "hi mom" on the tape, disk,
or wire, and send it to a machine with the opposite byte sex, that machine will
see "hi mom" (assuming, of course, that the hardware and/or software on both
ends uses the same character set).

However, if you put the number 127 on the tape, disk, or wire as a 4-byte
integer, and send it between two machines with different byte sexes, the number
will appear to be 2130706432 on the other machine.  A machine will generally
write a 4-byte integer on tape or disk or send it over the wire by putting the
byte with address 0, then the byte with address 1, then 2, then 3.  This means
that a little-endian machine will put out a byte with the value 127, and then 3
bytes with the value 0.  A big-endian machine will put out 3 bytes with the
value and then a byte with the value 127.  A machine with the opposite byte sex
will put the 127 in the *most*-significant byte of the integer and put the
zeroes in the lower three bytes.

Furthermore, floating-point formats differ in ways other than their byte order.
Most of the architectures listed above use the IEEE floating-point format
(either directly or in their floating-point coprocessors); however, neither the
IBM 360/370 nor the VAX do, and I don't think the Power 6/32 does either.

And, on top of that, the size of the C data types are not guaranteed to be the
same.  "int" is generally 4 bytes on the 360/370, VAX, the NS32K, WE32K, SPARC,
and MIPS architectures.  It may be 2 or 4 bytes on the 80*86 and Motorola 68K
architectures, depending on the implementation.  It may be *8* bytes on a
supercomputer.  It may be *3* bytes on a 24-bit machine.  On top of this,
there's not even a guarantee that a byte is 8 bits, or that an "int" is 16 or
32 bits; there exists at least two C implementations on 32-bit machines, one of
which even runs UNIX.

In short, the statement made by Scott Schwartz in the summary line:

	you had better use XDR or something similar

is 10,000% true, as is the statement in the original article:

	Further, it would seem that if that C structure were written out to a
	file, it could only be read properly by a machine of the same type as
	that which wrote it.

There are exceptions to this statement: a structure written out on an Intel
386-based machine *might* be readable directly on a NS32K-based machine, for
instance - althought I don't know that their alignment rules or floating-point
formats are the same (both are, I think, IEEE, but I don't know that the byte
order in *floating*-point numbers is the same).

These exceptions are rare, and as indicated I don't even know which of them
really exist.  If you want to write data to a file or put it out on the network
so that some other machine of a different type can read it, *don't* just dump a
raw structure; use the Sun XDR library, or roll your own routines that put
things out in a standard byte order with a standard floating point format,
standard alignment, standard data sizes, etc., etc..

And as for the particular question:

> >Does such incompatibilty truly exist?  If I create a file on a Sun/4
> >will I be able to read it on a Sun/3?

As Mr. Schwartz has already pointed out, the answer is "yes".  The Sun-3 uses
the MC68020 chip, and uses the alignment rules that most 68K UNIX C
implementations use:  structures are always aligned on at least a 2-byte
boundary, and most quantities are only aligned on 2-byte boundaries.  The Sun-4
uses the SPARC chip, and uses the rules listed above for that chip: structures
may be aligned on 1-byte boundaries if they contain nothing requiring a
stricter alignment, 4-byte quantities are aligned on 4-byte boundaries, and
8-byte quantities are aligned on 8-byte boundaries.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com