Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!sdd.hp.com!spool.mu.edu!news.nd.edu!mentor.cc.purdue.edu!pop.stat.purdue.edu!hrubin
From: hrubin@pop.stat.purdue.edu (Herman Rubin)
Newsgroups: comp.arch
Subject: Re: Compilers, architecture, and efficiency
Summary: Some specific answers
Message-ID: <11720@mentor.cc.purdue.edu>
Date: 1 May 91 19:34:15 GMT
Article-I.D.: mentor.11720
References: <1991Apr29.155945.29907@rice.edu> <1991Apr30.025904.13028@rice.edu> <3429@charon.cwi.nl>
Sender: news@mentor.cc.purdue.edu
Lines: 110

In article <3429@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
> In article <11609@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

			.......................

>  > One of the reasons why, in the late 50s and early 60s, binary was settled on
>  > (there are many others) was that multiplication by powers of 2 was rather
>  > common.
> I have heard many reasons, but never this one.  Can you give a reference
> please?

An explicit place where I saw this was in a paper discussing the design of the
MANIAC 2.

>  >          Only a few machines have this very important operation in floating
>  > hardware, and instead have to bring in the multiplier and use the relatively
>  > slow multiplication.  It is true that with pipelined hardware this is not so
>  > much of a problem, but I know of only a few machines which had it before then.
> Mm, some 360's had it I think, but that was completely micro code, and if my
> memory is right, later versions did it through a straight multiplication.
> There must be some reason.  Also, multiplication by 2 is *not* easy on IEEE
> conformant machines (think gradual underflow).  Kahan had some (good) reasons
> to include gradual underflow.  But IEEE machines might use the 'adjust exponent'
> primitive.  On the other hand, IEEE says nothing about what has to be done in
> hardware and what can be done in software.  And on your favorite machine (Cyber
> 205) DP multiplication by 2.0 need not even be exact!

On the 360 this would have been difficult, because the machine was base 16
rather than base 2.  The machine has a floating instruction to halve.

One machine which had this instruction was the CDC 3600.  The instruction was
ADX, add to exponent, and it managed overflows and underflows exactly the way
that they would be for other floating operations.  Clearly this can be a fast
operation.

I did not state that the 205 was a really well-designed machine, just that it
is the most versatile vector machine I know.  I am even inclined to agree that
the virtual memory is not all that great.  Not only may DP multiplication by
2.0 fail to be exact, even by 1.0.  But compare this to the total nuisance to
do DP operations on a CRAY, for example!  In fact, it is not common for DP
operations to be exact in general on most machines.

			......................... 

> As has been asked before: give examples.  How would you like to write in your
> favorite language your favorite examples:
> 1.  Check bit in a bit stream and mark as seen.
> 2.  Calculate length of a run of 0's or 1's in a bit stream and mark them as
>     seen.
> HINTS:
> They could be done along the lines of:
> 	bitstream b;
> 	bit b1;
> 	int count;
> 	/* another gripe of you */
> 	external end_of_bits();
> 	trap_at_bits_end(b, end_of_bits);
> 	b1 = nextbit(b); /* I hear you about infix operators */
> 	count = countbits(b);  /* should it count the bit just consumed? */
 
This is the best I have been able to do for 1.  In the envisioned applications,
I would not be counting the bits, but transferring on them.  Pardon the
somewhat clumsier notation, as I am not used to trap pragmas.  I am assuming
registers are 32 bits.
	initialize pointers, etc.
	double int *end_of_bits, *loc; 
	int left;
	registerpair r;
	if (left -= 1 < 0){
		if(loc >= end_of_bits){
			refill();
			loc = start_of_bits;}
		left = 64;
		r = *loc++;}
	else r <<= 1;
	if (r < 0) goto ...
	
For 2, I have not done as well even on scalar machines.  On vector machines,
allowing collapse of vectors, it can be done fairly easily by having the input
bitstream act as a selection for consecutive integers, with appropriate handling
of ends.  It could be done by counting here, but generally there are faster
methods, although they are tricky.  I doubt that they would be recognized even 
if these are.

> Next question:  how do you envisage an *efficient* hardware implementation?

Have hardware hold bitstreams, and have faults on end_of_bits, just as there
are now page faults.  The operations are not at all difficult from a hardware
viewpoint.

> If there is one, compilers are quite able to translate the above code to
> that implementation.  If you know of that *efficient* hardware implementation
> and if it is present on a current architecture, just convince your compiler
> vendor to put that optimization in your compiler.  If not, convince your
> hardware vendor to put the operations required in the hardware (and show
> how it can be done efficiently), and next convince your compiler vendor.

Do you really see a compiler writer recognizing that sequence of code, or one
of its alternatives, as this operation?  Especially since I, knowing that tests
are bad, might use a slightly different code for such situations where I have
some control on how many times the operation will be used between tests.
This would enable the use of
	left -= 1;
	if (r<<=1 < 0) goto ...

and the reloading of registers becomes much more complicated.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)