Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/17/84; site think.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!think!rose
From: rose@think.ARPA (John Rose)
Newsgroups: net.lang.c
Subject: A last word on arrays? Hah! (LONG)
Message-ID: <4992@think.ARPA>
Date: Wed, 16-Apr-86 13:10:42 EST
Article-I.D.: think.4992
Posted: Wed Apr 16 13:10:42 1986
Date-Received: Fri, 18-Apr-86 05:57:38 EST
Reply-To: rose@godot.UUCP (John Rose)
Followup-To: net.lang.c
Distribution: net
Organization: Thinking Machines, Cambridge, MA
Lines: 429
Keywords: array, reference, parameter, struct

[ Is this a dead horse or a sleeping dog? ]

Subject:  A last word on arrays? Hah! (LONG)
Summary:
  The inferior status of arrays, and fixes thereto, is ruminated upon.
  Four proposals are presented to the ANSI C committee.
  2000+ lines of array discussion are briefly summarized.

Proposals to ANSI:
	(1) Allow `&' applied to arrays.  Semantics must agree
	    with the implicit addressing which occurs when an array
	    is the first element of a larger array, and the larger
	    array name is used as a pointer.

	    (This is also Kenneth Almquist's proposal, I think.)

	(2) Allow `=' applied to arrays.  Details & rationale below.

	(3) Interpret current C practice of converting array arguments
	    to pointers as passing them by reference; make sizeof()
	    work consistently.

	(4) Allow arrays to be passed as by value.  Details & rationale
	    below.

But first...

Let's get two things very straight:  An array is *not* the same as a
pointer to its first element, and the latter is *not* a pointer to the
array.  These three ideas are all distinct, but in recent mailings they
have been confused in every possible combination:

  array name == 1st elt address:

    From: throopw@dg_rtp.UUCP (Wayne Throop)
    The notational convenience is that an array name "means" the address
    of the first element of the array.

    From: gwyn@BRL.ARPA (VLD/VMB)
    ... the NAME of an array (which means the same as the
    address of the first element of the array)

  array name == array address

    From: mike@peregrine.UUCP (Mike Wexler)
    It just happens that when you put "a" in your code the c compiler
    interprets it as &a.

  array name == array address && array name == 1st elt address:

    From: rh@cs.paisley.ac.uk (Robert Hamilton)
    In the same way (int a[10]) sits in memory somewhere and has an address a
    so you can say int *b=a which also happens to be &a[0].
  
  array address == 1st elt address
    From: PCC-based C compilers
    "tst.c", line 7: warning: & before array or function: ignored

    From: levy@ttrdc.UUCP (Daniel R. Levy) 
    Fortrans implement array references pretty much the same way that C
    does:  by reference to the address of the first element in the array,
    with offsets computed according to the subscripts and then automatically
    dereferenced for use.

Proofs of the existence and distinctness of those entities are easy
to get.  In a second, I'll give a machine-readable proof.  Lucid and
convincing arguments have been given by:

  jsdy@hadron.UUCP (Joseph S. D. Yao) in <322@hadron.UUCP>
  ka@hropus.UUCP (Kenneth Almquist) in <378@hropus.UUCP>

The main stumbling blocks are that
  (1) Arrays per se *almost* always turn into pointers to their first
      elements.  So people are used to thinking of them interchangeably,
      the more so because pointers can be used with what's called ``array''
      syntax.  BUT:  Kenneth Almquist notes that this implicit
      conversion doesn't happen with sizeof, and it could also be
      suppressed with `&'.
  (2) There is a tendency among C programmers to take the type structure
      for granted and think in implementational terms, machine words
      and all that.  So, many reason, since an array name compiles
      into a constant of some sort, that's ``all it is''.  Given
      "short a[2][5]", the addresses of a[0] and a[0][0] evaluate
      to the same machine pointer.  BUT:  Their types (and sizes) are
      different; if they weren't different subscripting would'nt work.

Try this thought experiment:  Suppose all three entities mentioned
above were implicitly interconverted.  Then an array per se gets
turned into the address of its 1st elt.  If that element is in
turn an array, this address can turn to that element per se,
and the cycle repeats, until the array is seen to ``really mean''
the address of the first non-array in the multidimensional aggregrate.
And you've flushed all the type information which would allow indexing
into that aggregate.

Here's code containing all three kinds of quantity.  Run it through your
C compiler.  It works for me on my BSD 4.2 Vax.  I claim that it proves
that PCC-based C compilers distinguish between arrays, their addresses,
and addresses of their elements.

	short two_A[2][5];
		/* two_A[0] is an array-of-short */
	#define A two_A[0] /* short A[5] */
	short (*Ap)[5] = two_A+0; /* == &A */
		/* Ap is a pointer-to-array-of-short */
	short *Ip = A;
		/* Ip is a pointer-to-1st-elt-of-array-of-short */ 
		/* ... a.k.a. pointer-to-short */
	
	main ()
	{
	#define POBJ(x) printf("x as object:  sizeof=%d, sizeof 1st elt=%d\n", \
		sizeof(x), sizeof(*x))
	#define PPTR(x) printf("x as pointer:  addr=%lx, sizeof=%d, incr=%d\n", \
		(long)x, sizeof(x+1), (char*)(x+1) - (char*)(x))
	#define P(x) POBJ(x); PPTR(x)
	P(two_A);
	P(A);
	P(Ap);
	P(Ip);
	}

Joseph Yao also ran similar code on several machines, with the same
conclusion.

Objections of the form ``But why would you ever want to...'' continually
surface in discussions of language-design, and this is no exception.
(((Flame on:  I personally ignore such objections.  They are invariably
based on thoughtless acceptance of existing restrictions.  The language
designer's prime function is not to second-guess details of future
applications, but to provide a symmetrical, easily grasped, powerful
medium for expressing currently-feasible algorithms.  His inquiry
should not be into what current programmers are likely to do, as much as
what current hardware *can* do.  Experience shows that attention to
clean overall design makes a language usable in unforeseen ways.
:Flame off.)))  It has been noted that the address of an array
can be taken implicitly, when that array is an element of a parent
array.  Gregory Smith gave this example:
  Article <2377@utcsri.UUCP> greg@utcsri.UUCP:
	char (*char_ar_ptr)[80];
	char line[80];	/* this is what I want to point it at */
	char_ar_ptr = &line;		/* this should work ... */

To which Chris Torek responded:
  Article <530@umcp-cs.UUCP> chris@umcp-cs.UUCP:
	char (*char_ar_ptr)[80];
	#define N	2	/* e.g. */
	char lines[N][80];
	char_ar_ptr = lines;
  If you only have one line, why do you need to point at a set of lines?

Chris's reasoning seemed to be that the only time you needed to take the
address of an array was when it was inside a bigger array:  Since you
can already take the address implicitly ("lines+1 == /*illegal:*/ &lines[1]")
what's the fuss about?  "Why would you ever want to take the address of
an object which wasn't an array member?"  <=== There it is!  Stomp on it!!
Greg, you needn't be at all embarrassed to admit wanting to take the
address of an object that's not an array element.  Many C programmers
do it.  ("int x, y[5]; foo() { bar(&x), bar(&y[2]); }" :-)

  Article <2439@utcsri.UUCP> greg@utcsri.UUCP
  Good point ... I actually hadn't thought of it exactly that way.
  Two answers come to mind, though:
	  (1) Because it is there. :-)
	  (2) Suppose the array 'lines' is actually more than two:
		  char lines[10][80];
To which I add
	  (3) Since when has taking the address of a lone object
	      become questionable in C??
But given my previous flame, I think that Greg's number (1) alone is
plenty cogent enough to answer the ``Why would you ever want to'' kind
of question.

Greg's final comment is eloquent; please allow a requote:

  So &a[b] is not equivalent to a+b here. What a shame. I like using
  &a[b] in general, and ALWAYS use it when 'a' is an array as opposed to a
  pointer. Too bad I can't... especially when there's no good reason for not
  being able to.

And so, dear reader, the address of an array has a perfectly reasonable
type and value.  But tradition and fuzzy thinking prevent us from being
able to *express* that value, if the array is not an element of an
enclosing array.  (And even then, we must use "(parent_array+4)" and
not the otherwise equivalent "&parent_array[4]".)



I rest my case.  And now may I be so bold as to propose arrays as
lvalues?  By now, no one should b



I rest my case.  And now may I be so bold as to propose arrays as
lvalues?  By now, no one should be saying "Wait!  You can't assign
to a compile-time constant!"  An array's address is a compile-time
constant, but an array per se (unless it is const) is modifiable.

Does this break existing code?  No.  Consider this fragment:

	typedef int MYBUF[3];
	MYBUF a, b;
	a = b;

An array name can currently never appear on the LHS of an assignment.
Now, let's suppose that if one were to appear there, as "a" does,
it is not immediately converted to a pointer, but remains an object
per se, with mutable contents.  The usual conversion of "b" to
pointer is suppressed, and the obvious assignment is performed.

	int *ip;
	ip = b;

This works as before:  "b" suffers the usual conversion.  The
conversion depends how the expression is going to be used.

"Oh no", you gasp.  "This is foreign to C; how will programmers learn
it?"  Well, there's already another set of usual conversions, and
they too are context dependent in exactly the same way.  For example:
	float a, b; double c;
	a = b;
	a = b+0;
	a = b+1;
	c = b;

The first assignment does not convert "b" to double, the second and
third do (on my compiler), and the fourth always does.

Model it this way:  Most operators don't accept floats per se or arrays
per se, and when they are handed one, they apply the "usual
conversions".  But some operators *can* accept the objects unconverted.
Since assignment and `&' currently don't accept *either* arrays
per se or their converted pointers, it is safe to define them
both to operate on the array per se.  (Objectors who assume
the conversion and derive an absurdity are falling prey to
the number (2) stumbling block I mentioned above.)

Am I adding features by allowing `&' and `=' to take array arguments?
No, I'm actually removing them.  The set of types which may be assigned
is currently "all objects with data storage allocated to them, *except*
arrays".  To remove that restriction is to remove a bug/feature of the
language, and make it easier, not harder, to understand C.  Consider
these identities:
	p+n == &p[n]
	(char *)(&x + 1) - (char *)&x == sizeof(x)
They hold true whenever p points to a data object.  "Oops, unless
*p or x is an array."  <== That's a feature, which is worth removing.

Case in point:  Joseph Yao said, in a thoughtless moment:
  Pointers, in C, only point to atomic or aggregate (structure/union)
  objects.
His phrase "atomic or aggregate (structure/union)" attempts to delimit
the set of first-class objects.  Note the clumsiness due to the
exclusion of array-type aggregates.  That mirrors the normal C
programmer's attempt to grapple with the array misfeature.

"But why would you ever want to assign arrays?  Isn't that inefficient?"
That's my business.  I want a language which allows me to express things
the machine can do.  Enclosing the array in a trivial struct will allow
me to assign to around, so why not dispense with the kludge?  If I want
a ``safe'' language, which only allows me to express ``good'' code, I'll
program in Pascal :-{u).  Also, an application has recently been
mentioned for arrays as lvalues:  dan@BBN-PROPHET.ARPA (Dan Franklin)
in <2050@brl-smoke.ARPA> gave plausible code in which jmpbuf objects
need to get assigned.  Many (not all) systems implement jmpbuf as an
array.



Now to questions of arrays as formal parameters.

If arrays are to be assigned (whether by a language primitive or by
bcopy), you need to know their size.  Consider this code:

	foo(x)  int x[5];
	{	return sizeof x;  }

This should return 5*sizeof(int), not sizeof(int*).  Such array
arguments should be treated consistently as *references*, not
*pointers*.  Consult the C++ spec for the distinction.  Even if
references are not included in the ANSI language, the idea is quite
useful.  (In the current net discussion, there has been a lot of mention
of "references", usually synonymous with pointers.  But they are not the
same, if your language has first-class pointers.  Assigning to a
reference bashes the pointed-to object, not the pointer.  Taking the
address of the reference returns the value of the reference, retyped as
a pointer.  Taking sizeof a reference gives the size of the pointed-to
object.)

This is my third proposal to the ANSI committee:  Use Bjarne
Stroustrup's semantics of references for array formal params.  This
would not break code which attempted to use the reference as a constant
pointer, since we have (ref array int) -> (array int) -> (ptr int) all
implicitly.  But it would break code (such as some versions of strcpy)
which attempts to increment the pointer value.  There is a way to get
around this:

	foo(x)  int x[5];
	{	x = x + 1;  }

The expression "x + 1" is a pointer.  The expression on the LHS is a
reference to an array; in all cases this should immediately dereference
to mean the array pointed to by x.  I suggest disabling the dereference
operation, just for formal params, and emitting a warning ("reference
parameter assigned to").  This gets backward compatibility, at the
price of dirtying Bjarne's semantics a bit.  One last objection,
which is fair:  If instead of "x + 1" the RHS were an array, then
the assignment should be of arrays, not pointers.  This could break
existing code.  Perhaps a warning could be emitted:  "using array
assignment on old-style parameter".



By the way, some people have suggested removing structures-as-values
from the language, and having a structure-valued expression immediately
take its own address (except, presumably, in the case of sizeof):

  From: bet@ecsvax.UUCP (Bennett E. Todd III) in <1294@ecsvax.UUCP>
  [Aggregates should not be] "automagically" copied around by the
  compiler; instances of their names as rvalues without explicit
  dereferencing should be converted to pointers in BOTH cases. Note that
  (re)establishing the notion that a structure name evaluates to a
  constant pointer to that structure removes the distinction between the
  "." and "->" operators for structure member dereference, which removes a
  popular source of subtle portability bugs.

This suggestion has been answered by dan@BBN-PROPHET.ARPA (Dan Franklin)
in <2051@brl-smoke.ARPA>, by showing examples (from the Blit software)
of small structs which make great lvalues.  As/if C++ comes into greater
use, the usefulness of this type of programming will become more and
more apparent, and the ``machine register'' school of thought will
perhaps lose the ascendency:

  rbj@icst-cmr (Root Boy Jim) in <2375@brl-smoke.ARPA>
  Fisrt, I disagree that struxure (or array) assignments are A Good Thing.
  I much prefer the model of limiting primitive data types to values that
  can be held in a register on an idealized machine. So much for the
  `religious' part.

Just `religion' on my part.  But there is a serious logical error in
Bennett's posting:  He suggests taking the address of a structure based
on the `precedent' of arrays doing the same thing.  But wait.  We've
seen that before:  An array does not convert to the address of itself,
but of its first element.  So there is really no precedent for
structures.  (Unless we converted them to pointers to *their* first
elements? :-)

However, the `.'/`->' distinction can be annoying.  Suggestions to
merge the two have a valid motivation.  Let me pass on a technique
I stumbled upon, which has saved me lots of debugging time, in
some applications.  (I wouldn't want this to be a language feature,
though.)  Always declare your structures as 1-element arrays.  Wherever
they occur--even as sub-structures.  Then, dispense with dot `.',
and always use arrows.  Voila!  Arrow works everywhere, whether
your LHS names a ``real'' structure or a pointer.  I actually
had a program with hundreds of structure qualifications, about
50/50 split between arrow and dot, which was immediately cleaned
up with this trick.  Downside:  Since arrays are passed by reference,
your structures are too.  But this might be what you want (it was for
me in that particular application).



Now to my fourth and last proposal for ANSI:  Allow arrays per se to be
passed as arguments, and returned from functions, by value.  This is
good for the same reasons as array assignment is:  Implementors of
data abstractions have more choices of the implementation type.
(Please, no ``why would you evers''.  If you place an arbitrary
restriction on programmers--``no assignment of arrays''--they'll
just find a messy workaround, such as a trivial enclosing structure,
and write unclear source code that produces the very same object
code you tried to make impossible.)

This proposal gets close to breaking existing code, since when an array
is passed to a function, it suffers the "usual conversion" to a pointer.
But let's take our cue from the float/double controversy:  Suppress the
"usual conversion" when a prototype is in scope.  For example:

	typedef int MYBUF[3];
	MYBUF a, b;

	MYBUF process_buf(MYBUF c)
	{	/* grovel over c, which is private copy */
		return c;
	}

This is backward compatible, since no old code uses function prototypes.
If you want arrays to be passed by reference, use the old-style function
declaration syntax (or introduce explicit declarations for reference
types!):

	mung_buf_by_reference(c)  MYBUF c;
	{	if (c[2])  --c[2];  }

Calls to functions with no prototypes use the `...' conversions,
which convert an array to a pointer to its first element.  When
the pointer gets to the function, which must be declared in the
old style, it is interpreted (transparently) as a reference, as
per my third proposal above.

The call by reference can also be partially expressed in the new
prototype syntax: 

	mung_buf_by_reference(int * c)
	{	if (c[2])  --c[2];  }

In the latter case, you lose the ``convenience'' of call-by-reference;
sizeof(c) is just sizeof(int *).  Oh no!  No more call-by-reference in
C?  What a loss!--the loss of an extraneous feature:  Only arrays ever
had call-by-reference in the first place.  In other words, the
call-by-reference, so carefully cleaned up in my third proposal, is a
``feature'' of the old-style declarations, which will someday wither
away.  (By then, C will have reference types, and not just for
array-params.)

To summarize:  I've asked for the addition of 4 new features to ANSI C,
which make arrays into first class objects.  I've argued that
programmers can be trusted to use good judgements, and that language
designers should refrain from making them in advance.  I've shown that
the new features actually fill gaps in the old language, and the
resulting language has *fewer* features, since there are fewer
exceptions.  (Next, we go after bitfields.)

			-- John Rose
-- 
----------------------------------------------------------
John R. Rose		     Thinking Machines Corporation
245 First St., Cambridge, MA  02142    (617) 876-1111 X270
rose@think.arpa				  ihnp4!think!rose