Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site mips.UUCP
Path: utzoo!watmath!clyde!bonnie!akgua!whuxlm!harpo!decvax!decwrl!Glacier!mips!mash
From: mash@mips.UUCP (John Mashey)
Newsgroups: net.arch
Subject: Re: Stack architectures - why not?
Message-ID: <188@mips.UUCP>
Date: Fri, 13-Sep-85 05:03:34 EDT
Article-I.D.: mips.188
Posted: Fri Sep 13 05:03:34 1985
Date-Received: Sat, 14-Sep-85 17:07:25 EDT
References: <796@kuling.UUCP> <172@myriasa.UUCP> <1094@ulysses.UUCP>
Organization: MIPS Computer Systems, Mountain View, CA
Lines: 86

teve Bellovin writes, in reply to Chris Gray:
> > I've been told by a couple of people who are normally well informed that
> > a pure stack architecture just isn't practical. They have NOT been able
> > to convince me of this. Anybody out there want to try?
>  
>  (bunch of comments, which seem pretty good ones)
> My conclusion:  the right answer, at least for now, is a machine with a good
> subroutine stack.  Other issues, notably the complexity of the instruction
> set, are open.
> 

In support of Steve's position are the following additional ones.
As always, it is VERY hard to analyze architectural features in isolation,
i.e., all generalizations are false; nevertheless:

1) COSTS IN FUNDAMENTAL DATA ACCESS TIMES.  For a given level of technology,
it always seems faster to do 
	add	reg1,reg2,reg3
where this means a) select values of reg1 and (if dual-ported reg file,
at same time), reg2, b) add them c) gate result back into reg3.
Rather than [assuming A is TOS, B is TOS+1]
	add
where this means (as in B5500, for example):
a) make sure both A & B are valid; if not, make 1-2 memory fetches,
and put them into A&B
b) add them, putting result back in B
c) mark A invalid.

Less registers = more memory traffic = faster access to registers at
lowest hardware level.
More registers = less memory traffice = slower access to the registers,
because either a) the registers not only have to act like registers,
but must also act like a giant shift register to keep the TOS at a
definite place, which (at chip level, anyway) gobbles realestate or
b) one needs an index register (related to the stack pointer) which
points to the TOS location within the register array.  This turns out
to be painful for the basic machine cycle, because it requires some
extra decoding time to find the TOS and the TOS+1 - unless there's
great trickery somewhere, I suspect there's an extra adder step required
somewhere, which is real ungood in the basic machine cycle.
You may note that most machines that have multiple register sets allocate
them in sets of powers of 2, so that reigster selection can occur by
concatenating the register number requested with high-order bits that
indicate which register set is used. Allowing variable-size
register windows is possible, but much harder.

2) PIPELINING PROBLEMS [this piece I'm less sure of]
At a given level of technology, one way to make things go faster is 
pipelining, or overlapping instruction execution.  Faster machines
tend to use more pipeline stages (not just IFETCH & EXECUTE, for example).
AMong other things, this requires complex "bypassing", whereby the
results of one operation may dynamically feed into the next, because
the next has already started well before the first finishes. In
general, this is easiest to do for very simple architectures, i.e.,
like CDC or CRAY machines, which are load/store architectures with little
or no complex side-effects and exciting address modes.  The more complex
the architecture, the more complex becomes the detection and handling of
pipeline hazards; the more complex, the slower.  Recall the number of
oddities that have popped over the years on machines with heavy use
of side-effects (like auto-increment addressing), especially in the
presence of memory protection errors; stack machines are like those,
but with auto-increment/decrement on almost every instruction!

ALthough some of the original technology arguments have disappeared,
it is worth noting that:
a) Although many people have been able to cost-reduce architectures over
the years, it seems that the Burroughs architectures have been difficult
to move unchanged to lower price levels, and at upper performance
levels, they've tended to go to multi-processors.  There may, of course,
be other reasons for the latter, and Burroughs has been in MP for a long time,
but it is often the case that you do that when the technology is hard to
make go much faster in uni-processors. Current dyadic IBM CPUs are similar
example.  [Not a criticism, just an observation;
I always admired the B5500 and its friends for the vision shown therein.]
b) One may conjecture why HP is replacing the (stack machine) HP-3000
with Spectrum (to all accounts, RISC architecture of load/store variety)
for more performance.

BOTTOM LINE: stack machines are elegant ijn some ways, but very hard
to make either really cheap or really fast.  MAYBE current VLSI technology
can overcome some of this, but it's not at all clear.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043