Path: utzoo!attcan!uunet!snorkelwacker!usc!cs.utexas.edu!yale!cmcl2!lanl!jlg
From: jlg@lanl.gov (Jim Giles)
Newsgroups: comp.society.futures
Subject: C's sins of commission (was: (pssst...fortran?))
Message-ID: <62927@lanl.gov>
Date: 14 Sep 90 02:08:25 GMT
References: <1990Sep13.185833.17455@cunixf.cc.columbia.edu>
Organization: Los Alamos Natl Lab, Los Alamos, N.M.
Lines: 260

From article <1990Sep13.185833.17455@cunixf.cc.columbia.edu>, by wp6@cunixa.cc.columbia.edu (Walter Pohl):
> 
> 	What do you mean about C's sins of commission?
>       Do you mean the lack of type checking?  [...]

Actually, what you're asking is a tough question.  There are so many
problems with C that just listing the more obvious ones would take many
pages.  It is difficult to turn to _any_ page of the C draft standard
without stumbling upon something with which I completely disagree. (By
the way, it is difficult to turn to any page of the final C standard
because I haven't seen any copies of it.  Has it even been published
yet?  It was finalized in January/February.)

Yes, type checking is a problem with C.  To my mind, it is one of C's
least egregious faults.  For one thing, most violations _are_ illegal
in C - just that most implementations don't bother checking.  I make
a careful distinction between a language and any particular implementation.
The faults of C that I most object to are those which cannot be corrected
because the language itself requires them.

As I said, _most_ type violations are already illegal.  Not all though.
Unions are not discriminated.  Pointer 'casts' are allowed (essentially
between _any_ two pointer types - officially, casts can only be between
'void' pointers and others but cast first to void then to anything else
is legal).

This leads us to pointers.  Just about everything about C pointers is
bad.  From the fact that pointers are hopelessly confused with arrays
(which are completely separate conceptually) to the syntax of pointer
use, C's pointers are a mess.  In addition, many language design people
now feel that pointers of _any_ kind are a bad idea.  C.A.R. Hoare
condemned them as long ago as the early 70's (about the time C was
'designed').  He pointed out that pointers are the data structuring
element that corresponds to GOTOs in flow control - if the one is
bad, so is the other.

-----------------------------------------------------------------------------

Since this is comp.society.futures, I will discuss pointer replacements.
Essentially, pointers only do three things for you: 1) recursive data
structures (graphs, trees, etc....); 2) dynamic memory; and 3) run-time
'equivalence'.  C pointer arithmetic only does what one dimensional array
indexing already does (scaled address calculations): arrays are better for
this - so it's _not_ counted as one of the features of pointers.

Recursive data structures are best implemented directly (to use a
C/Fortran like declaration syntax with the type names on the left):

	Type Tree is record
	   integer :: value
	   tree :: left, right
	end type Tree

Note that the elements inside a tree-valued data type are not _pointers_
but are actually trees themselves.  No more confusing pointers with
what they point to - the pointers aren't explicitly visible.  No more
forgetting the dereference operator (or, conversely, putting it in
incorrectly) - there isn't a dereferencing operator.  To be sure, the
compiler _may_ internally use pointers to do the implementation of
these recursive structures (but then, it probably uses GOTOs to internally
implement loops), but since they aren't explicitly visible to the user,
his life is much easier.

Dynamic memory should also be implemented directly.  Again, here is an
example:

	Dynamic Integer :: a(:,:)       !-- declares two dimensional a
	...   use of a here is illegal - not allocated yet ...
	ALLOCATE a(50,100)              !-- allocates 5000 words memory for a
	...   use of a here is legal ...

Of course, there would have to be an inquiry function do detect whether
the object was allocated or not.  Further, the decision would have to
made in the language design whether deallocation would be automatic
(garbage count, reference count, etc.) or whether the user would have
to explicitly deallocate things.  Either way, this is simpler, safer,
and easier to code, use, and debug than pointer usage.  Further, the
compiler can optimize uses of the dynamic object with the knowledge
that it's not aliased to anything - a fact the compiler cannot deduce
from malloc() calls (which as far as the compiler knows is just a function
which might be returning just any old address it feels like).

Run-time equivalencing is a feature which some people (with a good
deal of justification) claim shouldn't be allowed at all.  I disagree.
But there are still some distintions to be made.

First, equivalencing might be used just reuse statically allocated space
(although, using dynamic memory is probably better).

Equivalence might also be used to provide a form of array reshaping or
slicing - here pointers are inadequate: try the ALIAS/IDENTIFY feature
in the first draft Fortran 8X proposal.

Equivalence might also be used for defeating type checking - but here I
prefer to recommend the below:

	Type Float_internal is record
	   bit.1 :: sign
	   bit.8 :: exponent
	   bit.23:: significand
	End type Float_internal

	Float :: x                      !-- x is a simple float variable
	Map x as Float_internal         !-- overlays record onto x
	x = 5.0                         !-- x used as usual
	x.sign = 1                      !-- negate x - use the mapping
	x.exponent=x.exponent+1         !-- multiply x by 2 - use the map
	... etc ...

This makes the defeating of the type checking explicit and also makes
the indended use clearer.

One of the problems with C pointers is that you can locally tell if a
pointer is supposed to be an array, a recursive structure, an allocated
object, or some exotic run-time equivalence.  Providing all these possible
features with high-level syntax and separate functionality improves the
clarity of the code.  It usually even makes the code more succinct
(shorter).  So, to make a long story short (too late), I haven't yet
found any application which _needs_ explicit pointers either for speed
or functionality.  The above replacements either conceal or eliminate
pointers and are as (or more) efficient and easier to use.

-----------------------------------------------------------------------------

Now, back to C.

Related to type checking is mixed mode.  I don't object to mixed mode,
in fact: I support it.  But C's rules for applying it are not reasonable.
The _claim_ is that the rules are designed to allow speed.  Actually,
there is no rational reason for minus five divided by a thousand to
_ever_ be positive or to _ever_ be larger than one in magnitude.  The C
rules sometimes require that (-5/1000U == some large machine dependent
constant).  The C type heirarchy needs considerable adjustment.

This brings us to mixed type operations (not just mixed mode).  Since
C has no 'logical' type, you are allowed to mix arithmetic with the
results of conditionals with wild abandon.  I have never seen any
advantage to this - I HAVE seen a lot of people make a lot of costly
and time consuming mistakes as a result.  Further, the lack of a
'logical' data type means that they must provide more than one set
of boolean operators (and, or, not, xor) in order to have bitwise
and logical distinguished.

So, the next point is this bit about C's operators.  There are too
many operators and too many precidence levels.  Some (like the logical
vs. bitwise problem) would not be necessary if C had better intrinsic
data types.  Others perform functions which would probably be better
done as function calls (intrinsics which could be inlined of course).
Still others (like pointer dereferencing) should probably not exist at
all.

In spite of all these operators, character string concatenation,
string comparison, and substring operations are _not_ operators.
Even Fortran is better.

Data type declaration "operators" (or whatever you want to call the
syntax elements) are particularly ugly, obscure, peculiar, difficult,
and arcane.  I'm told that this is because they wanted a declaration
of a data type to look like a use of that type.  This leads us to:

The use of complicated data types is particularly ugly, obscure,
peculiar, difficult, and arcane.  At least they met their goal, the
syntax of using the variables is every bit as bad as that for declaring
them.

Assignment operators are necessary in a procedural language.  But, these
combinations of assignment with other operators is just useless
syntactic sugar.  Personally, I don't care if the language has them or
not, but they do clutter up the syntax quite a bit.  The main problem
with assignment is not the operators, per se, but the fact that they are
allowed _within_ an expression.  There have been several well conducted
experiments on the effect of such operators on user productivity - the
conclusion has been that assignment should be a statement level operator
and _not_ an expression level one - at least, if you want to maximize
user productivity.

While we're on the subject of productivity experiments, here's a few
other C features that have failed such tests:

	Control structures which used 'compound statements' (ie. sequences
	    bounded by BEGIN/END or {/} as C spells them).  Better is the
	    IF/ELSEIF/ELSE/ENDIF, WHILE/ENDWHILE , etc. style.  Even better
	    is allowing control constructs to be given unique labels and
	    matching them up (ie. Ada and Fortran 90 have this feature).
	End-of-line ignored within comments.  Comments should be
	    terminated by the end-of-line mark.  C++ has the option
	    of doing this.  Unfortunately, it still retains the old
	    wraparound version as well (the danger of developing a
	    backward compatible language is the load of junk that you
	    can't get rid of).
	End-of-line ignored within statements.  The experimenters decided
	    that people just seem to regard the end-of-line as the same
	    as the end-of-statement, they really do.  Even C programmers
	    intuitively know this.  I examined 10,000+ lines of commercial
	    C code and found only 12 lines which used the C ability to
	    wrap statements across lines automatically.  Even so, forgotten
	    semicolons almost _all_ occur at the end-of-line, and it is
	    still a very common syntax error.  I think the end-of-line
	    mark should be a synonym for semicolon and should be escaped
	    in the rare (12 out of 10,000) case that a continuation is
	    needed.
	Pointers - well, we've talked about them.

        GOTOs.  This is an interesting subject because there are
            actually conflicting results here.  Spaghetti code clearly
            (and in the experiments, this was shown) causes massive
            productivity problems.  However, in the test involving
            BEGIN/END control flow brackets, GOTOs were found to be one
            of the things which were better (by about a factor of 2)
            than 'compound statements'.  Other experiments involving
            "disciplined" GOTO usage (with "disciplined" pretty much
            meaning you'd expect) were compared with "Structured"
            GOTO-less programs and _no_ statistically significant
            difference with productivity was observed at all.  Actually,
            in this one case, I think C has got it exactly right - leave
            unrestricted GOTO in the language _and_ provide all the
            "Structured" control flow constructs.  One of the very few
            things that I think C did right.

There are several other experimental results - this is just a sampling.
The only experiment that I've ever seen in which the losing feature
wasn't in C was the one that showed that semicolon should be a terminator
not a separater.  C got this one right.  C was on the wrong side of
every other experiment I've ever seen.

Some non-experimental features which are widely regarded as bad ideas:

	Case sensitive syntax.  In a case insensitive language, code
	    can be easily shared, teamwork is easier, and upper-case
	    can be used for emphasis or other documentation purposes.
	    In a case sensitive syntax, communication between sites
	    (or even down the hall) is impeded by differing case
	    conventions.  People waste time ironing this out and not
	    doing more useful work.
	Nonintuitive syntax.  This is very common in C.  If a concept
	    has a widely developed and simple notation which is compatible
	    with the keyboard and/or print devices available, the language
	    _should_ make every effort to accomodate this common notation.
	    I will give one specific example: what in the world possessed
	    them to use a leading zero to distinguish octal from decimal???
        Inconsistent syntax.  Also common in C.  An operator, keyword,
            or construct should have the same meaning (as nearly as
            possible) in every context in which it is allowed.  A
            specific example is the keyword 'static', which means that
            the memory for the corresponding variable being declared is
            permanently associated with the variable for the entirity of
            run-time - except in the beginning of a file (outside and
            procedure), where 'static' suddenly means the same thing
            that other languages call 'private'. (All variables declared
            outside of procedures have permanently allocated memory
            anyway - so, 'static' should be regarded as redundant
            there.)

Well, as I predicted, even to touch on the small number of obvious
problems is several pages long.  I trust that you can see there are
still others lurking in the language specification (like 'switch', which
doesn't automatically put a 'break' between the cases - whoops  - I can't
stop once I'm on a roll).

J. Giles