Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/17/84; site think.ARPA Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!think!rose From: rose@think.ARPA (John Rose) Newsgroups: net.lang.c Subject: A last word on arrays? Hah! (LONG) Message-ID: <4992@think.ARPA> Date: Wed, 16-Apr-86 13:10:42 EST Article-I.D.: think.4992 Posted: Wed Apr 16 13:10:42 1986 Date-Received: Fri, 18-Apr-86 05:57:38 EST Reply-To: rose@godot.UUCP (John Rose) Followup-To: net.lang.c Distribution: net Organization: Thinking Machines, Cambridge, MA Lines: 429 Keywords: array, reference, parameter, struct [ Is this a dead horse or a sleeping dog? ] Subject: A last word on arrays? Hah! (LONG) Summary: The inferior status of arrays, and fixes thereto, is ruminated upon. Four proposals are presented to the ANSI C committee. 2000+ lines of array discussion are briefly summarized. Proposals to ANSI: (1) Allow `&' applied to arrays. Semantics must agree with the implicit addressing which occurs when an array is the first element of a larger array, and the larger array name is used as a pointer. (This is also Kenneth Almquist's proposal, I think.) (2) Allow `=' applied to arrays. Details & rationale below. (3) Interpret current C practice of converting array arguments to pointers as passing them by reference; make sizeof() work consistently. (4) Allow arrays to be passed as by value. Details & rationale below. But first... Let's get two things very straight: An array is *not* the same as a pointer to its first element, and the latter is *not* a pointer to the array. These three ideas are all distinct, but in recent mailings they have been confused in every possible combination: array name == 1st elt address: From: throopw@dg_rtp.UUCP (Wayne Throop) The notational convenience is that an array name "means" the address of the first element of the array. From: gwyn@BRL.ARPA (VLD/VMB) ... the NAME of an array (which means the same as the address of the first element of the array) array name == array address From: mike@peregrine.UUCP (Mike Wexler) It just happens that when you put "a" in your code the c compiler interprets it as &a. array name == array address && array name == 1st elt address: From: rh@cs.paisley.ac.uk (Robert Hamilton) In the same way (int a[10]) sits in memory somewhere and has an address a so you can say int *b=a which also happens to be &a[0]. array address == 1st elt address From: PCC-based C compilers "tst.c", line 7: warning: & before array or function: ignored From: levy@ttrdc.UUCP (Daniel R. Levy) Fortrans implement array references pretty much the same way that C does: by reference to the address of the first element in the array, with offsets computed according to the subscripts and then automatically dereferenced for use. Proofs of the existence and distinctness of those entities are easy to get. In a second, I'll give a machine-readable proof. Lucid and convincing arguments have been given by: jsdy@hadron.UUCP (Joseph S. D. Yao) in <322@hadron.UUCP> ka@hropus.UUCP (Kenneth Almquist) in <378@hropus.UUCP> The main stumbling blocks are that (1) Arrays per se *almost* always turn into pointers to their first elements. So people are used to thinking of them interchangeably, the more so because pointers can be used with what's called ``array'' syntax. BUT: Kenneth Almquist notes that this implicit conversion doesn't happen with sizeof, and it could also be suppressed with `&'. (2) There is a tendency among C programmers to take the type structure for granted and think in implementational terms, machine words and all that. So, many reason, since an array name compiles into a constant of some sort, that's ``all it is''. Given "short a[2][5]", the addresses of a[0] and a[0][0] evaluate to the same machine pointer. BUT: Their types (and sizes) are different; if they weren't different subscripting would'nt work. Try this thought experiment: Suppose all three entities mentioned above were implicitly interconverted. Then an array per se gets turned into the address of its 1st elt. If that element is in turn an array, this address can turn to that element per se, and the cycle repeats, until the array is seen to ``really mean'' the address of the first non-array in the multidimensional aggregrate. And you've flushed all the type information which would allow indexing into that aggregate. Here's code containing all three kinds of quantity. Run it through your C compiler. It works for me on my BSD 4.2 Vax. I claim that it proves that PCC-based C compilers distinguish between arrays, their addresses, and addresses of their elements. short two_A[2][5]; /* two_A[0] is an array-of-short */ #define A two_A[0] /* short A[5] */ short (*Ap)[5] = two_A+0; /* == &A */ /* Ap is a pointer-to-array-of-short */ short *Ip = A; /* Ip is a pointer-to-1st-elt-of-array-of-short */ /* ... a.k.a. pointer-to-short */ main () { #define POBJ(x) printf("x as object: sizeof=%d, sizeof 1st elt=%d\n", \ sizeof(x), sizeof(*x)) #define PPTR(x) printf("x as pointer: addr=%lx, sizeof=%d, incr=%d\n", \ (long)x, sizeof(x+1), (char*)(x+1) - (char*)(x)) #define P(x) POBJ(x); PPTR(x) P(two_A); P(A); P(Ap); P(Ip); } Joseph Yao also ran similar code on several machines, with the same conclusion. Objections of the form ``But why would you ever want to...'' continually surface in discussions of language-design, and this is no exception. (((Flame on: I personally ignore such objections. They are invariably based on thoughtless acceptance of existing restrictions. The language designer's prime function is not to second-guess details of future applications, but to provide a symmetrical, easily grasped, powerful medium for expressing currently-feasible algorithms. His inquiry should not be into what current programmers are likely to do, as much as what current hardware *can* do. Experience shows that attention to clean overall design makes a language usable in unforeseen ways. :Flame off.))) It has been noted that the address of an array can be taken implicitly, when that array is an element of a parent array. Gregory Smith gave this example: Article <2377@utcsri.UUCP> greg@utcsri.UUCP: char (*char_ar_ptr)[80]; char line[80]; /* this is what I want to point it at */ char_ar_ptr = &line; /* this should work ... */ To which Chris Torek responded: Article <530@umcp-cs.UUCP> chris@umcp-cs.UUCP: char (*char_ar_ptr)[80]; #define N 2 /* e.g. */ char lines[N][80]; char_ar_ptr = lines; If you only have one line, why do you need to point at a set of lines? Chris's reasoning seemed to be that the only time you needed to take the address of an array was when it was inside a bigger array: Since you can already take the address implicitly ("lines+1 == /*illegal:*/ &lines[1]") what's the fuss about? "Why would you ever want to take the address of an object which wasn't an array member?" <=== There it is! Stomp on it!! Greg, you needn't be at all embarrassed to admit wanting to take the address of an object that's not an array element. Many C programmers do it. ("int x, y[5]; foo() { bar(&x), bar(&y[2]); }" :-) Article <2439@utcsri.UUCP> greg@utcsri.UUCP Good point ... I actually hadn't thought of it exactly that way. Two answers come to mind, though: (1) Because it is there. :-) (2) Suppose the array 'lines' is actually more than two: char lines[10][80]; To which I add (3) Since when has taking the address of a lone object become questionable in C?? But given my previous flame, I think that Greg's number (1) alone is plenty cogent enough to answer the ``Why would you ever want to'' kind of question. Greg's final comment is eloquent; please allow a requote: So &a[b] is not equivalent to a+b here. What a shame. I like using &a[b] in general, and ALWAYS use it when 'a' is an array as opposed to a pointer. Too bad I can't... especially when there's no good reason for not being able to. And so, dear reader, the address of an array has a perfectly reasonable type and value. But tradition and fuzzy thinking prevent us from being able to *express* that value, if the array is not an element of an enclosing array. (And even then, we must use "(parent_array+4)" and not the otherwise equivalent "&parent_array[4]".) I rest my case. And now may I be so bold as to propose arrays as lvalues? By now, no one should b I rest my case. And now may I be so bold as to propose arrays as lvalues? By now, no one should be saying "Wait! You can't assign to a compile-time constant!" An array's address is a compile-time constant, but an array per se (unless it is const) is modifiable. Does this break existing code? No. Consider this fragment: typedef int MYBUF[3]; MYBUF a, b; a = b; An array name can currently never appear on the LHS of an assignment. Now, let's suppose that if one were to appear there, as "a" does, it is not immediately converted to a pointer, but remains an object per se, with mutable contents. The usual conversion of "b" to pointer is suppressed, and the obvious assignment is performed. int *ip; ip = b; This works as before: "b" suffers the usual conversion. The conversion depends how the expression is going to be used. "Oh no", you gasp. "This is foreign to C; how will programmers learn it?" Well, there's already another set of usual conversions, and they too are context dependent in exactly the same way. For example: float a, b; double c; a = b; a = b+0; a = b+1; c = b; The first assignment does not convert "b" to double, the second and third do (on my compiler), and the fourth always does. Model it this way: Most operators don't accept floats per se or arrays per se, and when they are handed one, they apply the "usual conversions". But some operators *can* accept the objects unconverted. Since assignment and `&' currently don't accept *either* arrays per se or their converted pointers, it is safe to define them both to operate on the array per se. (Objectors who assume the conversion and derive an absurdity are falling prey to the number (2) stumbling block I mentioned above.) Am I adding features by allowing `&' and `=' to take array arguments? No, I'm actually removing them. The set of types which may be assigned is currently "all objects with data storage allocated to them, *except* arrays". To remove that restriction is to remove a bug/feature of the language, and make it easier, not harder, to understand C. Consider these identities: p+n == &p[n] (char *)(&x + 1) - (char *)&x == sizeof(x) They hold true whenever p points to a data object. "Oops, unless *p or x is an array." <== That's a feature, which is worth removing. Case in point: Joseph Yao said, in a thoughtless moment: Pointers, in C, only point to atomic or aggregate (structure/union) objects. His phrase "atomic or aggregate (structure/union)" attempts to delimit the set of first-class objects. Note the clumsiness due to the exclusion of array-type aggregates. That mirrors the normal C programmer's attempt to grapple with the array misfeature. "But why would you ever want to assign arrays? Isn't that inefficient?" That's my business. I want a language which allows me to express things the machine can do. Enclosing the array in a trivial struct will allow me to assign to around, so why not dispense with the kludge? If I want a ``safe'' language, which only allows me to express ``good'' code, I'll program in Pascal :-{u). Also, an application has recently been mentioned for arrays as lvalues: dan@BBN-PROPHET.ARPA (Dan Franklin) in <2050@brl-smoke.ARPA> gave plausible code in which jmpbuf objects need to get assigned. Many (not all) systems implement jmpbuf as an array. Now to questions of arrays as formal parameters. If arrays are to be assigned (whether by a language primitive or by bcopy), you need to know their size. Consider this code: foo(x) int x[5]; { return sizeof x; } This should return 5*sizeof(int), not sizeof(int*). Such array arguments should be treated consistently as *references*, not *pointers*. Consult the C++ spec for the distinction. Even if references are not included in the ANSI language, the idea is quite useful. (In the current net discussion, there has been a lot of mention of "references", usually synonymous with pointers. But they are not the same, if your language has first-class pointers. Assigning to a reference bashes the pointed-to object, not the pointer. Taking the address of the reference returns the value of the reference, retyped as a pointer. Taking sizeof a reference gives the size of the pointed-to object.) This is my third proposal to the ANSI committee: Use Bjarne Stroustrup's semantics of references for array formal params. This would not break code which attempted to use the reference as a constant pointer, since we have (ref array int) -> (array int) -> (ptr int) all implicitly. But it would break code (such as some versions of strcpy) which attempts to increment the pointer value. There is a way to get around this: foo(x) int x[5]; { x = x + 1; } The expression "x + 1" is a pointer. The expression on the LHS is a reference to an array; in all cases this should immediately dereference to mean the array pointed to by x. I suggest disabling the dereference operation, just for formal params, and emitting a warning ("reference parameter assigned to"). This gets backward compatibility, at the price of dirtying Bjarne's semantics a bit. One last objection, which is fair: If instead of "x + 1" the RHS were an array, then the assignment should be of arrays, not pointers. This could break existing code. Perhaps a warning could be emitted: "using array assignment on old-style parameter". By the way, some people have suggested removing structures-as-values from the language, and having a structure-valued expression immediately take its own address (except, presumably, in the case of sizeof): From: bet@ecsvax.UUCP (Bennett E. Todd III) in <1294@ecsvax.UUCP> [Aggregates should not be] "automagically" copied around by the compiler; instances of their names as rvalues without explicit dereferencing should be converted to pointers in BOTH cases. Note that (re)establishing the notion that a structure name evaluates to a constant pointer to that structure removes the distinction between the "." and "->" operators for structure member dereference, which removes a popular source of subtle portability bugs. This suggestion has been answered by dan@BBN-PROPHET.ARPA (Dan Franklin) in <2051@brl-smoke.ARPA>, by showing examples (from the Blit software) of small structs which make great lvalues. As/if C++ comes into greater use, the usefulness of this type of programming will become more and more apparent, and the ``machine register'' school of thought will perhaps lose the ascendency: rbj@icst-cmr (Root Boy Jim) in <2375@brl-smoke.ARPA> Fisrt, I disagree that struxure (or array) assignments are A Good Thing. I much prefer the model of limiting primitive data types to values that can be held in a register on an idealized machine. So much for the `religious' part. Just `religion' on my part. But there is a serious logical error in Bennett's posting: He suggests taking the address of a structure based on the `precedent' of arrays doing the same thing. But wait. We've seen that before: An array does not convert to the address of itself, but of its first element. So there is really no precedent for structures. (Unless we converted them to pointers to *their* first elements? :-) However, the `.'/`->' distinction can be annoying. Suggestions to merge the two have a valid motivation. Let me pass on a technique I stumbled upon, which has saved me lots of debugging time, in some applications. (I wouldn't want this to be a language feature, though.) Always declare your structures as 1-element arrays. Wherever they occur--even as sub-structures. Then, dispense with dot `.', and always use arrows. Voila! Arrow works everywhere, whether your LHS names a ``real'' structure or a pointer. I actually had a program with hundreds of structure qualifications, about 50/50 split between arrow and dot, which was immediately cleaned up with this trick. Downside: Since arrays are passed by reference, your structures are too. But this might be what you want (it was for me in that particular application). Now to my fourth and last proposal for ANSI: Allow arrays per se to be passed as arguments, and returned from functions, by value. This is good for the same reasons as array assignment is: Implementors of data abstractions have more choices of the implementation type. (Please, no ``why would you evers''. If you place an arbitrary restriction on programmers--``no assignment of arrays''--they'll just find a messy workaround, such as a trivial enclosing structure, and write unclear source code that produces the very same object code you tried to make impossible.) This proposal gets close to breaking existing code, since when an array is passed to a function, it suffers the "usual conversion" to a pointer. But let's take our cue from the float/double controversy: Suppress the "usual conversion" when a prototype is in scope. For example: typedef int MYBUF[3]; MYBUF a, b; MYBUF process_buf(MYBUF c) { /* grovel over c, which is private copy */ return c; } This is backward compatible, since no old code uses function prototypes. If you want arrays to be passed by reference, use the old-style function declaration syntax (or introduce explicit declarations for reference types!): mung_buf_by_reference(c) MYBUF c; { if (c[2]) --c[2]; } Calls to functions with no prototypes use the `...' conversions, which convert an array to a pointer to its first element. When the pointer gets to the function, which must be declared in the old style, it is interpreted (transparently) as a reference, as per my third proposal above. The call by reference can also be partially expressed in the new prototype syntax: mung_buf_by_reference(int * c) { if (c[2]) --c[2]; } In the latter case, you lose the ``convenience'' of call-by-reference; sizeof(c) is just sizeof(int *). Oh no! No more call-by-reference in C? What a loss!--the loss of an extraneous feature: Only arrays ever had call-by-reference in the first place. In other words, the call-by-reference, so carefully cleaned up in my third proposal, is a ``feature'' of the old-style declarations, which will someday wither away. (By then, C will have reference types, and not just for array-params.) To summarize: I've asked for the addition of 4 new features to ANSI C, which make arrays into first class objects. I've argued that programmers can be trusted to use good judgements, and that language designers should refrain from making them in advance. I've shown that the new features actually fill gaps in the old language, and the resulting language has *fewer* features, since there are fewer exceptions. (Next, we go after bitfields.) -- John Rose -- ---------------------------------------------------------- John R. Rose Thinking Machines Corporation 245 First St., Cambridge, MA 02142 (617) 876-1111 X270 rose@think.arpa ihnp4!think!rose