Path: utzoo!mnetor!uunet!steinmetz!ge-dab!codas!pdn!alan
From: alan@pdn.UUCP (Alan Lovejoy)
Newsgroups: comp.lang.modula2
Subject: Re: union types
Message-ID: <2679@pdn.UUCP>
Date: 30 Mar 88 17:52:59 GMT
References: <8803221947.AA00248@nrl-iws6.ARPA> <4118@cup.portal.com> <850@vixie.UUCP>
Reply-To: alan@pdn.UUCP (0000-Alan Lovejoy)
Organization: Paradyne Corporation, Largo, Florida
Lines: 223

In article <850@vixie.UUCP> paul@vixie.UUCP (Paul Vixie Esq) writes:
>Sometimes you *want* that intervening obj_id.  In C, it's harder (though
>possible) to make a variant record where this intervening member needn't
>be named in references to the variant fields; in M2, you can do it thus:
>
>TYPE	ObjName = RECORD					(* note 1 *)
>		ObjType: INTEGER;
>		ObjId: RECORD
>			CASE BOOLEAN OF				(* note 2 *)
>				TRUE:   id: INTEGER|
>				FALSE:  path: POINTER TO CHAR;	(* note 3 *)
>			END
>		END
>	END;
>
>Note 1: we are creating a type in the C example, not a variable.

Who said otherwise?  The Modula-2 examples I have seen in this
discussion were all type definitions, weren't they?

>Note 2: No ':' before the type as far as I know; [brackets] may be needed
>	(I don't recall), and the type could be enumerated if more than
>	two variants are needed -- BOOLEAN is convenient but not mandatory.

You are both wrong and right:  the original syntax for Modula-2 did not
have a colon before the type of a tagless variant.  Most compilers
still support this syntax (usually as the only option).  However, Wirth
changed the syntax in the third edition of his book (PIM2e3) making the
colon required.  

>Note 3: POINTER TO CHAR is one way to represent strings, but sometimes arrays
>	are used.  Sure would be great if open arrays were allowed in places
>	other than a formal argument on a procedure...

POINTER TO CHAR is a TERRIBLE way to represent strings (unless you hide
this representation behind an opaque type).  Why?

1) There is no guarantee that SIZE(aCharVariable) = SIZE(string[0])
(assuming the declarations: 
   VAR aCharVariable: CHAR; string: ARRAY [0..n] OF CHAR).

This is not just theoretical.  My 68k M2 compiler uses two bytes
for a character variable but one byte for each character in a 
string.  This breaks the following code:

  VAR

    cp, end: POINTER TO CHAR;
    string: ARRAY [0..n] OF CHAR;

  ...

  cp := ADR(string);
  end := base + String.Length(string);
  WHILE ADDRESS(cp) < ADDRESS(base) DO
    Process(cp^);
    cp := ADDRESS(cp) + TSIZE(CHAR);
  END;

Even if we replace TSIZE(CHAR) with Char.lengthInAString, we still run
up against the problem that the compiler thinks cp^ is a reference to
two bytes, not one.  So it emits object code such as MOVE.W, ADD.W,
CMP.W, etc, when it should be emitting MOVE.B, ADD.B, CMP.B, etc. 
Whether this results in erroneous behaviour depends on the byte sex
of the CPU (and the byte sex assumed in the algorithm).

On the 68k, this is even more serious BECAUSE WORD MEMORY ACCESSES MUST
OCCUR ONLY FOR EVEN ADDRESSES.  An odd effective address used with WORD
or LONGWORD data results in a processor-generated ADDRESS ERROR.

POINTER TO CHAR is not a portable way to represent strings.

2) When the programmer sees 'string: POINTER TO CHAR', there is vital
information about this object which is completely missing:

  a) How big is the string?
  b) Has 'string' been properly initialized to point either to NIL
     or to some string?
  c) Does 'string' point to an object on the heap (memory from the
     string was allocated using NEW or ALLOCATE), or does it point
     to an object on the stack (string := ADR(aStackVariable)).
     You wouldn't want to call DISPOSE or DEALLOCATE on 'string'
     if it points to a stack variable.
  d) How many other pointer variables reference the same object?
     You don't want to DEALLOCATE 'string' if there are still
     active references to it.

POINTER TO CHAR is not a safe way to represent strings.

3) Programmers normally expect to be able to reference the i'th
character in a string using array-index syntax:  string[i].
If string is POINTER TO CHAR, that's not possible. Better is
'VAR string: POINTER TO ARRAY [0..Char.maxArray] OF CHAR;'.
'Char' is a definition module containing useful system dependent
parameters describing the properties of characters and arrays of
characters.  Char.maxArray is the highest zero-based index that
the compiler will allow for an ARRAY OF CHAR.  This permits
access to the i'th element using traditional syntax: string^[i],
yet still provides for pointer arithmetic and dynamic sizing.
It also finesses the SIZE(CHAR) problem.

Even better is:

  TYPE

    DynamicStringIndex = [0..Char.maxArray];
    
    DynamicString = 
      RECORD
        size: DynamicStringIndex;
        base: POINTER TO ARRAY DynamicStringIndex OF CHAR;
      END;

Best is:

  DEFINITION MODULE DynamicString;

    EXPORT QUALIFIED
      STRING, Index, ...;  (* PRIVATE is NOT exported *)  

    TYPE

      Index = [0..Char.maxArray];
      PRIVATE;
      STRING =
	RECORD
	  size: Index;  (* read only variable *)
	  base: PRIVATE;
        END;

4) "Open arrays" that are not procedure parameters are possible but 
do not come cheaply.  Assume the following declarations:  

  VAR
    string10: ARRAY [0..9] OF CHAR;
    string80: ARRAY [0..79] OF CHAR;
    foo: Bar;
    dynamicString: ARRAY OF CHAR;
    i: CARDINAL;

When the block in which these declaraction reside is entered, the
statically size objects (everything but  'dynamicSring' can easily
be allocated on the stack.  But the size of 'dynamicString' is
undefined, so it cannot be allocated.  What can be allocated is
a hidded variable which will point to 'dynamicString', and a hidded
variable which will specifiy the size of 'dynamicString'.  Somewhere
in the block, a value may be assigned to dynamicString:

  dynamicString := string10;

It would be nice if we could allocate the memory for dynamicString
on the stack at this point.  If the usage of dynamicString is as
simple as this case is so far, we can.   The problem is how to 
allocate memory on the stack for multiple open arrays whose size
changes more than once during execution of the block (open array
procedure parameters don't have this problem because their size
is known at block entry and cannot change until block exit). 
When the size of an open array changes, the value returned by
ADR(anOpenArray) probably will have to change as well.  Alogirithms
that are valid for static arrays will likely break if the static arrays
are redefined to be dynamic open arrays.

There is no general solution to this problem except to allocate
memory on the heap and not the stack.  So the only thing generic open
arrays give us is the ability to write 'anOpenArray[index]' instead of 
writing 'aDynamicArrayAllocatedByTheProgrammer^[index]'.  We could get 
the same effect by slightly changing the syntax of the language so that
'a[i]' is recognized as shorthand for 'a^[i]'.  Oh yeah, the compiler
automatically allocates and deallocates for us.  Which completely
hides from the programmer the fact that these arrays are heap objects.
Which has both its good and bad points.

It's simpler (for the compiler writer) not to open this can of worms.
If you feel you really need this functionality, I suggest you try
Smalltalk, LISP or APL.

Personally, I'd like to see new syntax permitting variables to 
have their initialization and termination processing defined 
as part of their declaration.  Example:

VAR
  i: CARDINAL := 0;  (* initialize i to zero *)
  a: POINTER TO ARRAY [0..n] OF CHAR 
   := NEW('Hello, world.') (* initialize a to NEW('Hello, world.');
			      NEW should be a function which accepts
			      the initial value of the allocated
			      object as its optional argument *)
   := DISPOSE(a); (* on termination of the block, assign DISPOSE(a) to a;
		     DISPOSE should also be a function *)
  x: REAL
   := 3.14159  (* initialize x to pi *)
   := circumference / (2.0 * radius);  (* on block exit, set x to be
					  the value of this expression *)
  circumference: REAL := 0.0;
  radius: REAL := 1.0;

The block termination code would execute just before the expression
following a RETURN statement is evaluated, or else just before executing
a RETURN (if the block is not a function).  Notice that this can help
to guarentee that functions don't return dangling pointers.

Another suggestion would be to change the dynamic of pointer syntax
so that a reference to a pointer variable references its dynamic object
instead of the address of its dynamic object:

  VAR

    p: POINTER TO FooBar;
    a: ADDRESS;
....

  p := aFooBar;   (* old syntax: p^ := aFooBar *)
  a^ := ADR(p);   (* old syntax: a := p *)
  a^ := p^;       (* old syntax: a := p *)

This makes it possible to abstract over an algorithm so that it is
valid either for pointers or non-pointers.  It's analogous to VAR
and VALUE parameters for procedures which make it possible to abstract
procedure calls with respect to arguments being passed as addresses
or as values.


--Alan@pdn