Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!mit-eddie!uw-beaver!rice!brazos.rice.edu!bbc From: bbc@legia.rice.edu (Benjamin Chase) Newsgroups: comp.arch Subject: Re: BitBlt, new instructions for RISC. Message-ID: Date: 26 Feb 90 01:55:10 GMT References: <7466@pdn.paradyne.com> Sender: root@rice.edu Reply-To: Benjamin Chase Distribution: usa Organization: Center for Research on Parallel Computations Lines: 124 In-reply-to: alan@oz.paradyne.com's message of 23 Feb 90 20:32:33 GMT alan@oz.paradyne.com writes: >In an email message to me, bbd@rice.edu wrote: >To which I attempted to reply via email. However: >[it failed] >So here is my reply: >I apologize: I did not realize how easy it would be to misinterpret what I >wrote. How painfully true. I'm bbc@rice.edu (as in British Broadcasting Company), not bbd@rice.edu. > I did not mean to imply that pixels which are vertically sequential >are also stored sequentially in memory. The point of my comment was how >"narrow" the screen is in terms of words of memory, to make it graphically :-) >obvious that many blits fall within a single word or two. But the _screen_ isn't narrow. A modern monochrome display is 32 (32 bit) words wide. >The point about >vertical lines is that they always (unless they, or the pixels themselves, >are very thick) fall within one or two words horizontally. Yes, and ~1000 words vertically. And these words are spaced at ~32 intervals. Thus, when you draw a vertical line, you get a word of screen memory, perform some operation on it to turn on or off ~1 bit of that word, and then write it back to screen memory. Then, you get the next word, which is 32 words further along, and whoops, it's not in the data cache, because your silly fetcher got the next 3 consecutive words of screen memory (which you won't be needing right now because your vertical line is so skinny and all), rather than the next 3 words spaced at 32 word offsets, which is what you really wanted it to do if you were drawing a vertical line. >However, be advised that character >glyphs may be stored one character glyph per character (so that the entire >glyph bitmap fits into a single cache line), or perhaps one whole font face >per glyph. Fine, I'll certainly agree with "one character glyph per character", and that for any normal (monochrome) screen resolution and font face that a row of a character glyph fits into a single cache line, and usually into a single machine word. > In any case, characters are normally blitted in a loop which >writes a whole string to the screen, often soon followed by another. So it >is likely that the most frequently used characters would have their pixels >in the cache already. But that's only half the problem. What about the screen memory that isn't in the cache because it won't fit? (Especially when you're filling the cache with screen words that you won't be writing upon, as in my example above.) >Well, suppose you want to market your CPU as a graphics engine? That is another story entirely. I thought this whole thing started by discussing what sorts of new instructions might fit into the next RISC processor. Granted, just because it's a "RISC processor" (TM) :-) doesn't mean it won't be marketed as a graphics engine. Now, getting back to "new instructions for RISCs", sort of... [I think most of this is from a brain-storming session one recent night that afflicted me and Preston Briggs. :-) ] Suppose you're designing a superscalar architecture. A recent example of such an architecture allows a branch, an "integer instruction", and a floating point instruction to be issued all at once. This seems like a good idea, especially if your chip set will be used for heavy-duty number crunching, or for generating really high SPEC marks, especially on those floating point benchmarks. :-) But if you weren't so interested in floating point, what else would you put into that part of the instruction? Well, probably not graphics operations, per se, since you wouldn't expect graphics operations to be sprinkled throughout your code, where the compiler would find them and fold them together with branches and additions into neat little triplets. Even counting on a MERGE instruction to occur throughout your "instruction stream" is a little extreme, because if all the memory accesses turn out to be aligned, you've just thrown away a large chunk of your superscalar chip's potential performance (cf. relative performance of the IBM 6000s on integer vs. floating point benchmarks). Perhaps in the floating point slot of the superscalar instruction, you'd put (among other things?) little hints to the cache instead? Whisper gently into the processor's ear, saying "get the value in register N, and use that as a stride for yanking words into the cache until I tell you something different". Perhaps N contains the width (in machine words) of a bitmap, and this will cause the cache to get the 4 words out of your bitmap, one (conceptually) stacked on top of another, so that you can blit 4 rows of your character glyph into them, before missing again. Or maybe you're addressing an array in the "wrong order" (for a particular language), not getting consecutive elements, but getting only one element out of each row or column, before moving onto the next row or column. Or maybe you used one of those wacky array-reshaping commands from FORTRAN 2001... :-) Then register N might contain the number of words in a row or column. Or maybe the correct thing to do in a particular case is tell the processor not to load anything extra into the cache, just get what's needed, because the next several instructions will be loading words from all over the place, and just because location A was loaded is absolutely no indication that location A+1 will be needed anytime soon. Or maybe the current instruction is loading a value that won't be needed again for a loooong time. In this case, we'd like to tell the cache to just hand the word to the processor, and not waste any space keeping a copy for itself. A good compiler might notice all of cases, and construct the appropriate whisper for that third slot of the superscalar instruction. Does anyone else have thoughts on this? What kinds of things would _you_ put in the slots of your superscalar chip(s)? -- Ben Chase , Rice University, Houston, Texas