Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!oliveb!pyramid!nsc!roger
From: roger@nsc.UUCP
Newsgroups: comp.arch,comp.sys.nsc.32k
Subject: Re: NS32532 Patents
Message-ID: <4230@nsc.nsc.com>
Date: Wed, 22-Apr-87 01:52:17 EST
Article-I.D.: nsc.4230
Posted: Wed Apr 22 01:52:17 1987
Date-Received: Thu, 23-Apr-87 23:40:13 EST
References: <4206@nsc.nsc.com> <4042@sci.UUCP>
Organization: National Semiconductor, Sunnyvale
Lines: 81
Xref: utgpu comp.arch:990 comp.sys.nsc.32k:102

In article <4042@sci.UUCP>, kenm@sci.UUCP (Ken McElvain) writes:
> > 
> > 1.)  The method of detecting and handling memory-mapped I/O
> >      by a pipelined microprocessor.  ----- 
> Not clear just what the problem is.  Presumably the I/O addresses
> can identify themselves, so the cache just has to pay attention.

There are two hardware mechanisms.  One is a hand-shake protocol
using two signals, one called IOINH/ and one called IODEC/.  These
will both force references to the data cache to be non-cacheable
as well as force the proper sequencing of reads and writes.The second mechanism
is by dedicating the upper 16 Mbytes of the memory map to be for I/O.

> > 2.)  Maintaining coherence between a microprocessors integrated cache
> >      and the external memory.  ----
> An extra tag set for the instruction cache so it can monitor all writes
> to the data cache.  A simpler solution it to make it illegal architecturally
> to write into your own instruction stream and to provide a mechanism
> for flushing cache blocks.

The issue here is more related to providing hooks to allow hardware external
to the CPU to invalidate the internal caches.  There are 7 cache invalidate
address inputs and 4 control lines that will allow external hardware
to invalidate either an entire cache, or set of a cache, or an 
individual line (16 bytes) of a cache or set.

 
> > 3.)  Monitoring control flow in a microprocessor -----  
 
> We used a small special purpose cache for this.  The way it worked
> was that the address of the conditional branch was hashed down to 9 bits
> which were used to index a 512x2 bit ram.  The two bits were used to
> implement a "slow learner" state machine that predicted which way the
> branch would go.  We saw a 95% prediction rate if programs were allowed
> dropped into the 80-85% rate for our test cases.  Being a slow learner
> means that it only makes one mistake on the execution of a loop,
> on the very last pass.  We also tried various 1,2, and 3 bit state machines
> but none of them worked as well.  Credit for this goes to Mike Manlove at
> HP.  There is also quite a bit of literature on the subject.

Your approach is far more elaborate than the one we use. Part of the reason
is that the 32532 was/is targeted towards applications which are context
switch intensive.  Our approach takes into account that programs
typically have loops and that branches backward are taken more often
than not.  Our brochure is confusing in this area. The predictor section
of the chip has a separate address calculation unit so that this
can be done in parallel with other operations.  I will give a more
detailed response in this area in reference to a posting by Craig Hansen.

> > 6.)  Method for completing instructions without waiting for writes. ----
 
> I remember reading about CDC machines back in the dark ages doing this.
> Essentially the output fifo contained both addresses and data and
> each read did a partial comparison (about 8 bits) of the read address
> against all the write addresses in the fifo and if a match was found
> then the data was grabbed out of the fifo and the writes had priority.
> Virtual addressing might complicate this if aliasing is allowed.
 
Our approach is not this elaborate.  Since the data cache is write-through,
the cache is always up to date and external writes can be delayed.
In addition to this, there are mechanisms that check whether a subsequent
instruction is reading an operand befor it has been written even in the
cache.  The read will be delayed.  This is somewhat similar to how the
pipe handles register referneces.
 
> > 7.)  Method of optimizing instruction fetches.
> Instruction buffers.
> Instruction caches.
> Fetching multiple paths simultaniously.
> Using branch prediction to fetch the probable path.
> Putting the instruction decoder on the other side of the instruction
> 	cache.  (this takes the next address and branch target calculation
>         out of the critical path)

The reference here was more related to fetching the instruction opcode
itself.  Yes we have buffers and caches etc as you list above, but
since the CPU supports dynamic bus sizing, instruction fetching
can be from 8, 16 or 32 bit wide memory.  There are scenarios
where both non-sequential and sequential fetching is supported.

Roger