Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!iuvax!rutgers!apple!voder!berlioz!nelson From: nelson@berlioz (Ted Nelson) Newsgroups: comp.arch Subject: Re: SISC Summary: Pipelining it would be easy? Message-ID: <184@berlioz.nsc.com> Date: 6 May 89 16:51:50 GMT Reply-To: nelson@berlioz.UUCP (Ted Nelson) Distribution: usa Organization: National Semiconductor, Santa Clara Lines: 60 I am fascinated by the entire concept of a single instruction computer, and I feel it is possible that this idea will make it to market as a extremely low-cost general-purpose processor. Of course, an entire generation of software tools will have to be rethought; for one, self-modifying code will become a much more powerful (necessary?) method. But the memory dependence is extremely high. The van der Poel instruction requires 3 operand fetches, 2 data reads, and one data write. Assuming that these cannot take place concurrently, that we have a system based on 100ns memory, and ignoring all other factors, each instruction takes 600ns. This instruction rate is about equivalent to a 12 Mhz 68000, but each instruction is considerably less powerful. First idea: Since the operand fetches are in adjacent words, we can fetch them at the same time using triple-interleaved memory (this will require a bit more logic than typical interleaving) and three separate buses on the processor -- which is no problem since they are independent. We could also take care of the data reads in the same way by putting a (severe?) restriction on the software (a la RISC "let the compiler deal with it") that operands cannot be of the same modulus 3. So using this idea, we get each instruction's memory access time down to 300 ns -- twice the throughput. Second obvious idea: Pipeline the sucker. I only have a basic understanding of pipelines, but it seems to me that a straight three or four stage pipe cannot work because of the memory conflict -- the fetch (F), read (R), and write (W) stages cannot operate concurrently. So let me propose two more stages: Computation (C) {essentially the subtract} and Branch (B) computation based on the condition code (the only condition code, Negative). The stages operate FRCWB, and in operation will be as follows: F R C W B F R C W B F R C W B As you can see, we still have a memory conflict between the Write of the "current" instruction and the Read of the next instruction. My first reaction was to add another software restriction in that the Write and the two Reads had to have addresses of different modulus 3. But I think that this is too severe and renders it unusable -- this is too much for the compiler to handle. Or is it? Can anyone come up with a better pipelining scheme? Or anyway of improving the performance? Keep in mind that the market for this is as a very low cost processor, so the problem cannot be solved by using dual-port RAM. Unless, of course, dual-port RAM drops considerably in price. Or we could use National Semiconductor's new memory product: 1 Megabit Write-Only Memory (WOM). This is extremely inexpensive, has an access time of only 10 ns, and will be available in a dual-port version in only a few months. If you wish to order any of this great part, pleast contact me directly -- it is such a secret project that we haven't let Marketing in on it yet. -- Ted. "When comes The Revolution, things will be different! Not better. Just different."