Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!oliveb!pyramid!nsc!roger From: roger@nsc.UUCP Newsgroups: comp.arch,comp.sys.nsc.32k Subject: Re: NS32532 Patents Message-ID: <4230@nsc.nsc.com> Date: Wed, 22-Apr-87 01:52:17 EST Article-I.D.: nsc.4230 Posted: Wed Apr 22 01:52:17 1987 Date-Received: Thu, 23-Apr-87 23:40:13 EST References: <4206@nsc.nsc.com> <4042@sci.UUCP> Organization: National Semiconductor, Sunnyvale Lines: 81 Xref: utgpu comp.arch:990 comp.sys.nsc.32k:102 In article <4042@sci.UUCP>, kenm@sci.UUCP (Ken McElvain) writes: > > > > 1.) The method of detecting and handling memory-mapped I/O > > by a pipelined microprocessor. ----- > Not clear just what the problem is. Presumably the I/O addresses > can identify themselves, so the cache just has to pay attention. There are two hardware mechanisms. One is a hand-shake protocol using two signals, one called IOINH/ and one called IODEC/. These will both force references to the data cache to be non-cacheable as well as force the proper sequencing of reads and writes.The second mechanism is by dedicating the upper 16 Mbytes of the memory map to be for I/O. > > 2.) Maintaining coherence between a microprocessors integrated cache > > and the external memory. ---- > An extra tag set for the instruction cache so it can monitor all writes > to the data cache. A simpler solution it to make it illegal architecturally > to write into your own instruction stream and to provide a mechanism > for flushing cache blocks. The issue here is more related to providing hooks to allow hardware external to the CPU to invalidate the internal caches. There are 7 cache invalidate address inputs and 4 control lines that will allow external hardware to invalidate either an entire cache, or set of a cache, or an individual line (16 bytes) of a cache or set. > > 3.) Monitoring control flow in a microprocessor ----- > We used a small special purpose cache for this. The way it worked > was that the address of the conditional branch was hashed down to 9 bits > which were used to index a 512x2 bit ram. The two bits were used to > implement a "slow learner" state machine that predicted which way the > branch would go. We saw a 95% prediction rate if programs were allowed > dropped into the 80-85% rate for our test cases. Being a slow learner > means that it only makes one mistake on the execution of a loop, > on the very last pass. We also tried various 1,2, and 3 bit state machines > but none of them worked as well. Credit for this goes to Mike Manlove at > HP. There is also quite a bit of literature on the subject. Your approach is far more elaborate than the one we use. Part of the reason is that the 32532 was/is targeted towards applications which are context switch intensive. Our approach takes into account that programs typically have loops and that branches backward are taken more often than not. Our brochure is confusing in this area. The predictor section of the chip has a separate address calculation unit so that this can be done in parallel with other operations. I will give a more detailed response in this area in reference to a posting by Craig Hansen. > > 6.) Method for completing instructions without waiting for writes. ---- > I remember reading about CDC machines back in the dark ages doing this. > Essentially the output fifo contained both addresses and data and > each read did a partial comparison (about 8 bits) of the read address > against all the write addresses in the fifo and if a match was found > then the data was grabbed out of the fifo and the writes had priority. > Virtual addressing might complicate this if aliasing is allowed. Our approach is not this elaborate. Since the data cache is write-through, the cache is always up to date and external writes can be delayed. In addition to this, there are mechanisms that check whether a subsequent instruction is reading an operand befor it has been written even in the cache. The read will be delayed. This is somewhat similar to how the pipe handles register referneces. > > 7.) Method of optimizing instruction fetches. > Instruction buffers. > Instruction caches. > Fetching multiple paths simultaniously. > Using branch prediction to fetch the probable path. > Putting the instruction decoder on the other side of the instruction > cache. (this takes the next address and branch target calculation > out of the critical path) The reference here was more related to fetching the instruction opcode itself. Yes we have buffers and caches etc as you list above, but since the CPU supports dynamic bus sizing, instruction fetching can be from 8, 16 or 32 bit wide memory. There are scenarios where both non-sequential and sequential fetching is supported. Roger