Path: utzoo!attcan!uunet!crdgw1!uakari.primate.wisc.edu!zaphod.mps.ohio-state.edu!ub!oswego!news From: dl@g.g.oswego.edu (Doug Lea) Newsgroups: comp.std.c++ Subject: Array References: request for comments Message-ID: Date: 20 Oct 90 12:03:12 GMT Sender: news@oswego.Oswego.EDU (Network News) Reply-To: dl@g.oswego.edu Distribution: comp Organization: SUNY Oswego Lines: 673 I'm considering submitting the following to X3J16 (ANSI C++). But before I do, I'd like to get more feedback. All comments would be welcome, in an attempt to make this as strong and useful a proposal as possible. Please email replies directly (dl@g.oswego.edu). Thanks! ----------------- % Format with latex \documentstyle{article} \begin{document} \title{{X3J16 Proposal:} \\ {Array References}} \author{Doug Lea} %\date{} \maketitle \section{The Problem} In C and C++, a {\tt T*} may ambiguously refer to either a pointer to a single object, or a pointer to the first of an array of objects. (Throughout this proposal {\tt T} is a stand-in for any first-class C++ type.) This ambiguity is alternately considered to be among C's major strengths, in that it allows the total integration of pointers and arrays, and as its greatest weakness in providing the infrastructure for C++, in that it compromises type safety and impedes optimizability of the kinds of high-level constructs that are much more characteristic of C++ than C programs. \subsection{Example 1} \samepage{\begin{verbatim} class B { public: int bvar; virtual int f() { return 0; } }; \end{verbatim}} \samepage{\begin{verbatim} class D : public B { public: int dvar; int f() { return 1; } }; \end{verbatim}} \samepage{\begin{verbatim} void useB(B* p) { int i = p[1].f() + p[1].bvar; } \end{verbatim}} \samepage{\begin{verbatim} main() { B single_b; useB(&single_b); // (1) B array_of_b[10]; useB(array_of_b); // (2) D single_d; useB(&single_d); // (3) D array_of_d[10]; useB(array_of_d); // (4) B* bp = 0; useB(bp); // (5) } \end{verbatim}} The safety problems are that a client (caller) of {\tt useB} cannot tell by {\tt useB}'s signature whether any or all of \begin{enumerate} \item A pointer to a single B \item An array of B's \item A pointer to a single D \item An array of D's \item A null pointer \end{enumerate} \noindent are legal arguments to {\tt useB}. As it turns out, only an array of B's is really supported: Calls (1) and (3) are unsupported because {\tt useB} attempts to treat the scalars {\tt single\_b} and {\tt single\_d} as arrays, and call (4) is unsupported because C++ arrays are necessarily {\em homogeneous}, (i.e., filled with elements all of the same exact class) requiring that the declared type ({\tt B}) be treated as an {\em exact} type when the compiler determines the byte offset (via {\tt sizeof(B)}) of the element {\tt p[1]}. Since {\tt sizeof(B) != sizeof(D)}, the computed offset is incorrect. Call (5) is unsupported since {\tt useB} does not check whether the pointer is null. Thus calls (1), (3), (4) and (5) are type-legal but wrong. There ought to be a way for programmers to express such intentions. This is a more serious type safety compromise in C++ than in C, since ambiguities due to subclassing are not present in C. \subsection{Example 2} \samepage{\begin{verbatim} class intVec { int sz; int* data; public: intVec(int s) :sz(s), data(new int[sz]) {} ~intVec() { delete [] data; } int& operator[] (int i) { return data[i]; } int length() { return sz; } }; \end{verbatim}} \samepage{\begin{verbatim} void clear(intVec& v) { for (int i = 0; i < v.length(); ++i) v[i] = 0; } \end{verbatim}} Any C++ compiler attempting to perform standard optimizations on {\tt clear} will be bothered by a fundamental obstacle. Even though {\tt length()} is inlined, because {\tt v.data} is allowed to point to {\em any} int, a compiler must conservatively assume that {\tt v.data} {\em could} point to {\tt v.sz}, that could in turn be set to zero within the assignment statement. This means that {\tt v.sz} must be retrieved from memory on each iteration rather than being kept in a register, or more importantly, being used as a loop constant that would allow this code to be transformed into the essentially optimal C-like \samepage{\begin{verbatim} { int* p = v.data; int* fence = &v.data[v.sz]; while (p != fence) *p++ = 0; } \end{verbatim}} \noindent or even a vectorization of this loop on a machine supporting such things. Among the basic efficiency differences between C and C++ programming is that C programmers who are interested in guaranteeing efficient performance can always manually rewrite functions like {\tt clear} in such a manner themselves. However, because of data encapsulation, C++ programmers cannot obtain low-level data access. Since only compilers are allowed to (safely) break encapsulation, programmers must rely on compilers to optimize functions much more than C programmers do. Thus, optimizability issues will continue to play a far greater role in C++ than they have in C. Programmers and libraries creating utility classes like {\tt intVec} will discover surprising and intractable efficiency losses for doing so unless these issues are addressed. This is an increasingly important issue for some users. Programmers will not always heed the standard advice (e.g., ARM, p 137) to create and use array-like classes instead of C constructs if they obtain noticeably slower code when doing so. It is easy to concoct larger examples where this kind of problem causes much more devastating optimization failures. This is not the sort of issue that can {\em always} be solved by merely creating cleverer compilers that attempt to discern for themselves whether a variable is serving as a pointer to a scalar versus an array. Moreover, even if compilers were to be able to handle the majority of such cases via advanced dataflow techniques, the programmer type safety issues remain: It is useful for the {\tt intVec} author to make crystal clear to other programmers, not just the compiler, that {\tt data} must point to an array, not a scalar. Programmers who desire greater type safety and optimizability need to have a mechanism to express the necessary constraints in a manner that does {\em not} require any other changes to C++, so that existing code continues to be accepted. \section{A Solution} Perhaps surprisingly, C++ already possesses the basic construct needed to solve many of these problems, the array reference. It is legal to declare references to arrays in either of two forms: \begin{itemize} \item ``Open array references'' that do not specify array capacity, denoted via {\tt T (\&open\_ref)[]} \item ``Closed array references'', that do specify capacity, denoted via {\tt T (\&closed\_ref)[N]}, where {\tt N} is a {\em compile-time} constant). \end{itemize} The fact that the syntax for declaring and using array references is very awkward may have somehow contributed to the fact that their basic semantics and properties have apparently never been fully spelled out.\footnote{Readers of the present proposal may find many syntactic constructs difficult to immediately grasp. My only defense is that I did not design this syntax.} The present proposal examines these issues and proposes answers in ways that maximally improve both type safety and optimizability, while also maintaining perfect back-compatibility. If these issues are settled, perhaps programmers will become accustomed to creating simple {\tt typedefs} that eliminate some of their awkwardness. This proposal does not go so far as to suggest particular wording changes of particular sections of the draft ARM. Instead, the basic issues are surveyed, and precise rules and their consequences are discussed. It should be a relatively easy matter to integrate these into the corresponding sections of the standard. In fact, with only a few (indicated) exceptions, there is nothing in this proposal that directly contradicts the ARM. Thus, it may be viewed as merely an interpretation of the utility of array references and their implementation. In other words, regardless of agreement on the details of this proposal, the semantics of array references probably need to be addressed by X3J16, since (1) they are not otherwise completely defined in the ARM, (2) current C++ compilers cannot be depended on to provide any consistent semantics for this construct, and (3) if, say, only one C++ compiler were to implement the interpretation discussed here, and programmers were to regularly exploit this support, serious incompatibilities could result. \section{Representational Properties} In C++, a reference is merely an ``alias''. As discussed, for example, in the ARM, p 153, declaration of a reference does not automatically entail the creation of a behind-the-scenes pointer requiring dereferencing. In fact array references {\em never} require extra indirection in order to access the base of the referenced array. Any {\tt T (\&)[]} or {\tt T (\&)[N]} may be internally represented in precisely the same fashion as a as {\tt T param[]} or {\tt T*}, i.e., via a pointer to the first cell of the array.\footnote{While it is not clear to me whether X3J16 can mandate this kind of economical representation of array references, an important aspect of the present proposal is that the advantages of array references may be obtained without additional overhead over other forms.} For example, all three of the following might generate precisely the same machine code: \samepage{\begin{verbatim} void strcpy1(char* s, const char* t) { for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; } \end{verbatim}} \samepage{\begin{verbatim} void strcpy2(char s[], const char t[]) { for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; } \end{verbatim}} \samepage{\begin{verbatim} void strcpy3(char (&s)[], const char (&t)[]) { for (int i = 0; t[i] != 0; ++i) s[i] = t[i]; } \end{verbatim}} The only programmer-visible difference is type safety. For example, it should be illegal to write {\tt char c; strcpy3(\&c, \&c);}. While (as discussed below) {\tt T param[]} and {\tt T* p} are often used interchangeably, the otherwise parallel constructs {\tt T (\&aref)[]} and {\tt T* (\&ptr)} are representationally distinct, and thus require stronger type enforcement. \section{First Class Status} Array references share many similarities with the C {\tt T param[]} construct (i.e., an array declared as a function parameter, as in {\tt strcpy2}). They differ in two important respects: While {\tt T param[]} is only legal in parameter declarations (and a few other miscellaneous spots), and can almost always be used interchangeably with {\tt T*}, array references are legal wherever any kind of reference is legal, and cannot be interpreted as equivalent to {\tt T*}. Use of array references clears up some problems encountered in the above examples: \subsection{Example 1, revisited} \samepage{\begin{verbatim} void useB2(B (&a)[]) { int i = a[1].f() + a[1].bvar; } \end{verbatim}} \samepage{\begin{verbatim} main() { B single_b; useB2(&single_b); // (1) ERROR B array_of_b[10]; useB2(array_of_b); // (2) D single_d; useB2(&single_d); // (3) ERROR D array_of_d[10]; useB2(array_of_d); // (4) ERROR B* bp = 0; useB(bp); // (5) ERROR } \end{verbatim}} Here, the calls that were previously legal but unsafe would be caught at compile time, assuming implementation of the declaration and matching rules to be described below. Additionally, an interesting minor optimization can be performed. Since C++ arrays are necessarily homogenous, the virtual call {\tt a[1].f()} can be resolved at compile time. It must be B's version of {\tt f()} being called, not any other. \subsection{Example 2, revisited} \samepage{\begin{verbatim} class intVec { int sz; int (&data)[]; public: intVec(int s) :sz(s), data((int(&)[])(new int[sz])){} ~intVec() { delete [] &(data[0]); } int& operator[] (int i) { return data[i]; } int length() { return sz; } }; \end{verbatim}} \samepage{\begin{verbatim} void clear(intVec& v) { for (int i = 0; i < v.length(); ++i) v[i] = 0; } \end{verbatim}} Since {\tt v.data} must {\em not} refer to a scalar, the optimization otherwise missed inside of {\tt clear} can be handled. (The odd mechanics of integrating array references with the {\tt new} and {\tt delete} operators are addressed below.) \section{Declaration and type-matching issues} Examples have so far only used open array references. However, closed array references are also occasionally useful. In fact, the two constructs differ enough that separate sets of rules must be created to handle them. By definition, an open array reference can serve as an alias for an array with any number of elements, with the actual number of elements often unknown to the programmer. Declaration of a closed array reference represents a stronger commitment on the part of the programmer. Since, to remain backwards compatibity, closed array references must only use compile-time constants for array bounds, they are less often usable.\footnote{Their utility could be marginally improved by extending the typesafe C++ {\em conformance} type matching rules to array references, so that a reference to, say 100 ints could be attached to an array of 200 ints. Unfortunately, this strategy conflicts with the assumptions of C matrix operations.} Since the little-used {\tt T param[]} construct represents a weaker version of array references, one could create rules to enable transformation from one form to the other in those cases where it is unambigously safe. However, because the {\tt T param[]} construct cannot currently be depended on be considered distinct from {\tt T*} (C and C++ compilers traditionally treat them as synonyms) a simpler strategy is to formalize this type equivalence (as is implicit in the ARM, p 138, as well as in ANSI C). This then allows simpler rules for dealing with array references, as opposed to either construct. \subsection{Rules} Given these considerations, the necessary rules for declaring array references should be clear: An open array reference, {\tt T (\&r)[]} may alias \begin{itemize} \item any array object {\tt T a[N]}, or \item an array object referenced by a closed array reference, {\tt T (\&cr)[N]}, or \item an array object referenced by another {\tt T (\&or)[]}; \end{itemize} \noindent where {\tt T} must be an {\em exact} type match, not a subclass, because of C++ homogeneity restrictions on arrays. A closed array reference, {\tt T (\&r)[N]} may alias \begin{itemize} \item an array object {\tt T a[N]}, or \item an array object referenced by another {\tt T (\&cr)[N]},; \end{itemize} \noindent again with the exactness restriction. No provision is made for implicitly coercing an open to a closed array reference. As is always the case, explicit coercions are available, and might even be useful by those who are aware of their inherent dangers. (Casts and coercions are, of course, dangerous, because compilers are generally allowed to {\em believe} in the veracity of declared types, which might lead to transformations and optimizations that fail to make sense for a casted object.) No additional rules are required to support multidimensional arrays or matrices. None of these rules are inconsistent with the ARM. Indeed, some are already implicit in other definitions and rules in the ARM, and appear consistent with at least some of the observed behavior of some C++ compilers. \subsection{Relation to {\tt T*}} C requires that a {\tt T*} pointer may be implicitly coerced into a {\tt T []}, and vice versa. But there is every reason to resist implicit coercions between {\tt T*} and array references. In passing, it may be noted that coercions may be used in order to create references to subarrays: \samepage{\begin{verbatim} T a[200]; T (&suba)[50] = (T (&r)[50])(&a[150]); \end{verbatim}} \subsection{Overloading} Overloading sensitivity and resolution rules, and type matching generally, should be governed by the above requirements. For example, function {\tt void f(int (\&a)[])} would be distinguishable from void {\tt f(int* p)} (which is, according to the ARM, p138, the same function as {\tt f(int param[])}). While sound, this has a surprising effect: \samepage{\vspace{-0.5pt}\begin{verbatim} extern void f(int*); extern void f(int (&)[]); int x[100]; f(x); // ambiguous int (&rx)[100] = x; f(rx); // call array reference version \end{verbatim}} Unless changes in the overload resolution policies are made, programmers will have to refrain from overloading pointer versions of functions in programs using array references. Such issues have interesting consequences for C++ library standardization. For example, should the C library version of {\tt strcpy} have a C++ prototype along the lines of {\tt strcpy1} or of {\tt strcpy3} above? \section{Assignment} In C++, reference assignment (as opposed to initialization) always entails {\em deep copy} semantics, which in the case of array references, imply the need for array copies. However, this conflicts with the set of C rules that prevent array copies from ever automatically occuring, even in those cases where the bounds of the copy are determinable at compile time. Hence, the most conservative strategy is to simply make array reference assignments illegal, in the same manner that array assignments are illegal. For example, \samepage{\vspace{-0.5pt}\begin{verbatim} T x[100]; T (&rx)[100] = x; T y[100]; T (&ry)[100] = y; x = y; // ERROR rx = y; // ERROR x = ry; // ERROR \end{verbatim}} \section{Array Pointers} Array references fail to clear up safety and efficiency issues in those cases where pointers are required instead of references. Just as it is legal to define a reference to an array, it is legal (and again, awkward) to define a pointer to an array via \begin{itemize} \item ``Open array pointers'' that do not specify array capacity, denoted via {\tt T (*open\_ptr)[]} \item ``Closed array pointers'' that do specify capacity, denoted via {\tt T (*closed\_ptr)[N]}, where {\tt N} is a {\em compile-time} constant). \end{itemize} These have similar representational properties as array references. They do not require additional indirection. The most important situation in which array pointers are required is in dealing with allocation and destruction of arrays of objects. The C++ definitions of the ``vector'' versions of {\tt new} and {\tt delete} operators accentuate the scalar versus array pointer ambiguity found in C. While {\tt vector\_new} actually returns a pointer to an array of objects ({\tt T (*)[]}), it is typed as returning a simple pointer ({\tt T*}). Similarly, the argument for {\tt vector\_delete} should be ({\tt T (*)[]} but is listed as {\tt T*}. This type laxity allows many common programming errors to remain uncaught by compilers. For example, in \samepage{\vspace{-5.0pt}\begin{verbatim} B* bp = new D[100]; /* ... */ delete [] bp; \end{verbatim}} \noindent the base ({\tt B}) element size step would be {\em incorrectly} used to traverse through the array of {\tt D}s while calling destructors, with indeterminate (but surely unpleasant) run time consequences. Additionally, current definitions require the kinds of machinations seen in the {\tt intVec} constructor and destructor in order to ``coerce'' arguments into their logically correct form. The solution to this is first solidify rules for array pointers, and then to correctly define the type of {\tt vector\_new} and {\tt vector\_delete}. \subsection{Rules} Rules for declaring and initializing array pointers should correspond exactly to those for references: \begin{itemize} \item An open array pointer {\tt T (*p)[]} may point to anything that an open array reference {\tt T (\&r)[]} may refer to. \item A closed array pointer {\tt T (*p)[N]} may point to anything that a closed array reference {\tt T (\&r)[N]} may refer to. \end{itemize} For example, \samepage{\vspace{-5.0pt}\begin{verbatim} T x[100]; T y[200]; T (*p)[] = &x; // p points to x *p = 17; // ERROR (*p)[0] = 17; // set x[0] = 17 T (*q)[100] = &x; // q points to x (*q)[1] = 2; // set x[1] = 2 T (&r)[] = y; // r refers to y p = &r; // repoint p to y (*p)[2] = 49; // set y[2] = 49 q = p; // ERROR p = q; // OK \end{verbatim}} Related issues, including overloading sensitivity and assignment, are also identical to those listed described for array references. \subsection{Relation to {\tt T*}} While, because of lack of existing precedent, there is no need to specially consider the intercoercibility of array references and simple pointers, there is a great deal of C++ code that relies on the representational equivalence and lack of type discernment between {\tt T*} and {\tt T (*)[]}. Requiring that these forms not freely intercoerce is clearly most defensible. The greatest impact would be on {\tt vector\_new} and {\tt vector\_delete}. A new set of rules, conflicting only in type details with the existing ARM (e.g., p60), are needed stating that \begin{itemize} \item The form {\tt new T[ expression ]} invokes {\tt vector\_new} and returns a {\tt T (*)[]}. \item The form {\tt delete[] arg} requires that {\tt arg} have exact type {\tt T (*)[]} and calls {\tt vector\_delete}, returning {\tt void}. \item {\tt vector\_new} and {\tt vector\_delete} are not overloadable. \end{itemize} For example, given this, the {\tt intVec} constructor and destructor could be rewritten intelligibly, in a manner perfectly analogous to those allocating and freeing scalars: \samepage{\begin{verbatim} class intVec { //... intVec(int s) :sz(s), data(*(new int[sz])) {} ~intVec() { delete [] &data); } //... }; \end{verbatim}} These changes would invalidate existing code of the form \samepage{\vspace{-5.0pt}\begin{verbatim} T* p = new T[100]; /* ... */ delete [] p; \end{verbatim}} \noindent There are two possible ways to deal with this: \begin{enumerate} \item Phase in the changes. \item Allow {\tt T*} and {\tt T(*)[]} to silently, ``trivially'' intercoerce. \end{enumerate} Choosing the second option negates {\em all} type safety advantages. \section{Homogeneity and type safety} As discussed in my (mainly unrelated) ``Customization in C++'' paper (1990 USENIX C++ conference), it is difficult in C++ to express the fact that a function is {\em intentionally} type-specific, properly working for a particular class, but not necessarily for subclasses. As noted above, this is among the issues successfully addressed by the the use of array references in the special case of functions requiring homogeneous array arguments. The most common and important examples are ``homogeneous object container classes'' like simple stacks, queues, sets, lists, etc., , in which {\em copies} of objects (created via {\tt X(const X\&)} or {\tt X::operator = (const X\&)}) passed by reference into an insertion member function are placed inside a container. Such methods are type-specific, since exact versions of the copy constructor and/or assignment operator must be hard-wired into implementations. It is almost always the class author's intention to disallow {\tt sub-T} arguments, in order to prevent suprising and undesirable ``chopped copies'' and unintentional insertion of objects without exact type {\tt T}. Such containers bear more than a passing similarity to C arrays. For both efficient implementability and type safety, these containers are strictly homogeneous, like C arrays. Array references can provide a modicum of help for expressing this correspondance. Since arrays are necessarily homogeneous, arguments for container methods can be declared as arrays of single elements. For example, a simple Stack could be declared via, {\samepage\vspace{-5.0pt}\begin{verbatim} class TStack100 { private: int sp; T data[100]; public: void push(const T (&a)[1]) { data[sp++] = a[0]; } //... }; \end{verbatim}} \noindent so as to disallow the use of {\tt sub-T}'s as arguments to {\tt push}. Note that, generally, this technique extends the kinds of optimization opportunities discussed above. Use of exact types in these contexts, as enabled by homogeneity restrictions, means that virtual function calls can be statically resolved. Of course, {\em using} such a class would not be especially simple or convenient, since arguments for methods like {\tt push} would either need to be framed as singleton arrays, or explicitly coerced into arrays, which, if done routinely via macros or the like, ends up defeating the safety goals. However, there is a way to accomodate simpler client usage via an ad hoc special case rule. Recall that allowing {\em implicit} coercions from scalar to array references and vice versa negates the optimizability goals discussed with respect to the {\tt intVec} example. However, if intercoercibility were allowed for the special case of {\em singleton} closed array references only, pragmatic attainment of both goals seems reachable. That is, declaration and matching rules could be extended to include: \begin{itemize} \item A {\tt T (\& singleton)[1]} may alias any object of exact type {\tt T}. \end{itemize} Overloading sensitivity and dispatch rules would have to be modified accordingly. This is an uncomfortable proposal because it yet further complicates matching, coercion, and overloading rules, which are difficult enough as it is. However, the resulting type safety enhancements seem well worth the trouble. An alternative proposal, that obtains the same effect without taking a backdoor approach, is briefly discussed in my ``customization'' paper. \section{Epilogue} The primary limitation of array references is that they, like any other references are not ``reseatable'': C++ references may only alias one object throughout their lifetimes. This makes the use of array references impossible in common cases involving resizing, copy-on-write, delayed allocation, sharing, etc., of referred-to arrays. Surely others have proposed to X3J16 that this restriction be removed. I would like to go on record as supporting any such change. Among the simplest possible mechanisms for doing so is to define a non-overloadable built-in polymorphic pseudofunction (i.e., with the same status of {\tt sizeof}), {\tt void reseat(ANY\& r, ANY\& newobj)} that makes {\tt r} refer to {\tt newobj} Any other constructs with the same effect would suffice. Of course {\tt const} references should not be reseatable. Allowing reseatable references would contribute to the battle to create safer, more optimizable subsets of C++. The utility of references in terms of type precision, inherent promises against performing pointer arithmetic, and simplicity of use should not be wasted because of quibbles about the most appropriate means for expressing things like reseating. \end{document} -- Doug Lea dl@g.oswego.edu || dl@cat.syr.edu || (315)341-2688 || (315)443-1060 || Computer Science Department, SUNY Oswego, Oswego, NY 13126 || Software Engineering Lab, NY CASE Center, Syracuse Univ., Syracuse NY 13244