Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!jarthur!uunet!mcsun!ukc!edcastle!dcl-cs!aber-cs!athene!pcg From: pcg@cs.aber.ac.uk (Piercarlo Grandi) Newsgroups: comp.arch Subject: Re: loadable control store, an idea whose time has gone Message-ID: Date: 2 Nov 90 15:35:45 GMT References: <1536@ftc.framentec.fr> <1990Oct19.120218.9450@canterbury.ac.nz> <15497@hydra.gatech.EDU> <2176@lupine.NCD.COM> <42310@mips.mips.COM> <42488@mips.mips.COM> <2817@crdos1.c Sender: pcg@aber-cs.UUCP Organization: Coleg Prifysgol Cymru Lines: 66 Nntp-Posting-Host: odin In-reply-to: henry@zoo.toronto.edu's message of 1 Nov 90 04:45:15 GMT On 1 Nov 90 04:45:15 GMT, henry@zoo.toronto.edu (Henry Spencer) said: henry> In article <2817@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com henry> (bill davidsen) writes: davidsen> Loadable control store is a great idea, and can really improve davidsen> the performance of a program... henry> Well, if you're using a microprogrammed CPU with a control henry> store in the first place. Nobody in his right mind designs a henry> high-performance system that way any more, given a choice. Uhmmmm. We remain to be convinced. What has been proven so far is that general purpose microprogrammed instruction sets are not a win because the high level instructions that you can then implement are mostly useless in a general purpose environment. But this thread was about the usefulness of having multiple high level instruction sets, each tailored to oen particular purpose. There is no reason for this not to work, and it results in impressive code size reductions. henry> You improve the performance of the program even more by going henry> to a RISC CPU which has a cache instead of a control store and henry> runs user code at one instruction per cycle. Only if you have unlimited real memory... But what you say is still mostly true -- in the sense that this implies inlining the ad-hoc high level instructions, and this seems better than offlining them in the control store. This works well because most time in a program is spent in loops, and loops can fit, even inlined, in the cache, and we can build caches that are fast enough and do not steal bandwidth from data accesses (Harvard architectures). But here we have a tradeoff -- the same effect may be achieved by having a single high level instruction (tailored for the purpose -- one could even have a tool to generate an ad hoc microcode for the specific program) that expands to a call to an offline sequence of micro instructions in control store, or with an already expanded sequence of simple instructions in an I cache, but the performanc implications are very different. In the offline case we have extra dispatch time, but even more direct access to the innards of the CPU/ALU in the micro instructions. In the inline case we have direct execution, but the simple instructions are more abstract. Little can be done to avoid the problem with extra dispatch time to the control store, except implementing the simplest high level instructions as special, direct execution, cases; on the other hand we could have very low level, microprogram like, instructions at the architecture level (e.g. VLIW), but then so much of the CPU/ALU innards are exposed that recompilation becomes necessary across the architecture implementations, which is a no-no since the system/360 days. There are also the system wide implications -- better code density makes for smaller working sets, and even small improvements in code locality mean much lower page fault rates, and given the relative cost of a page fault, this may be important. Currently code density is not reckoned important, and the extra dispatch time to the control store is. Maybe offlining will become more important with superscalars (it is already important with vector machines). -- Piercarlo "Peter" Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk