Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!mit-eddie!uw-beaver!rice!brazos.rice.edu!bbc
From: bbc@legia.rice.edu (Benjamin Chase)
Newsgroups: comp.arch
Subject: Re: BitBlt, new instructions for RISC.
Message-ID: <BBC.90Feb25195510@legia.rice.edu>
Date: 26 Feb 90 01:55:10 GMT
References: <7466@pdn.paradyne.com>
Sender: root@rice.edu
Reply-To: Benjamin Chase <bbc@rice.edu>
Distribution: usa
Organization: Center for Research on Parallel Computations
Lines: 124
In-reply-to: alan@oz.paradyne.com's message of 23 Feb 90 20:32:33 GMT

alan@oz.paradyne.com writes:

>In an email message to me, bbd@rice.edu wrote:
>To which I attempted to reply via email. However:
>[it failed]
>So here is my reply:

>I apologize:  I did not realize how easy it would be to misinterpret what I
>wrote.

How painfully true.  I'm bbc@rice.edu (as in British Broadcasting
Company), not bbd@rice.edu.

>  I did not mean to imply that pixels which are vertically sequential
>are also stored sequentially in memory.  The point of my comment was how
>"narrow" the screen is in terms of words of memory, to make it graphically :-)
>obvious that many blits fall within a single word or two.

But the _screen_ isn't narrow.  A modern monochrome display is 32
(32 bit) words wide.

>The point about
>vertical lines is that they always (unless they, or the pixels themselves,
>are very thick) fall within one or two words horizontally.

Yes, and ~1000 words vertically.  And these words are spaced at ~32
intervals.  Thus, when you draw a vertical line, you get a word of
screen memory, perform some operation on it to turn on or off ~1 bit
of that word, and then write it back to screen memory.  Then, you get
the next word, which is 32 words further along, and whoops, it's not
in the data cache, because your silly fetcher got the next 3
consecutive words of screen memory (which you won't be needing right
now because your vertical line is so skinny and all), rather than the
next 3 words spaced at 32 word offsets, which is what you really
wanted it to do if you were drawing a vertical line.

>However, be advised that character
>glyphs may be stored one character glyph per character (so that the entire
>glyph bitmap fits into a single cache line), or perhaps one whole font face
>per glyph.

Fine, I'll certainly agree with "one character glyph per character",
and that for any normal (monochrome) screen resolution and font face
that a row of a character glyph fits into a single cache line, and
usually into a single machine word.

>  In any case, characters are normally blitted in a loop which
>writes a whole string to the screen, often soon followed by another.  So it 
>is likely that the most frequently used characters would have their pixels
>in the cache already.

But that's only half the problem.  What about the screen memory that
isn't in the cache because it won't fit?  (Especially when you're
filling the cache with screen words that you won't be writing upon, as
in my example above.)

>Well, suppose you want to market your CPU as a graphics engine?

That is another story entirely.  I thought this whole thing started by
discussing what sorts of new instructions might fit into the next RISC
processor.  Granted, just because it's a "RISC processor" (TM) :-)
doesn't mean it won't be marketed as a graphics engine.

Now, getting back to "new instructions for RISCs", sort of...  [I
think most of this is from a brain-storming session one recent night
that afflicted me and Preston Briggs. :-) ]

Suppose you're designing a superscalar architecture.  A recent example
of such an architecture allows a branch, an "integer instruction", and
a floating point instruction to be issued all at once.  This seems
like a good idea, especially if your chip set will be used for
heavy-duty number crunching, or for generating really high SPEC marks,
especially on those floating point benchmarks. :-) But if you weren't
so interested in floating point, what else would you put into that
part of the instruction?

Well, probably not graphics operations, per se, since you wouldn't
expect graphics operations to be sprinkled throughout your code, where
the compiler would find them and fold them together with branches and
additions into neat little triplets.  Even counting on a MERGE
instruction to occur throughout your "instruction stream" is a little
extreme, because if all the memory accesses turn out to be aligned,
you've just thrown away a large chunk of your superscalar chip's
potential performance (cf.  relative performance of the IBM 6000s on
integer vs.  floating point benchmarks).

Perhaps in the floating point slot of the superscalar instruction,
you'd put (among other things?) little hints to the cache instead?
Whisper gently into the processor's ear, saying "get the value in
register N, and use that as a stride for yanking words into the cache
until I tell you something different".

Perhaps N contains the width (in machine words) of a bitmap, and this
will cause the cache to get the 4 words out of your bitmap, one
(conceptually) stacked on top of another, so that you can blit 4 rows
of your character glyph into them, before missing again.

Or maybe you're addressing an array in the "wrong order" (for a
particular language), not getting consecutive elements, but getting
only one element out of each row or column, before moving onto the
next row or column.  Or maybe you used one of those wacky
array-reshaping commands from FORTRAN 2001...  :-) Then register N
might contain the number of words in a row or column.

Or maybe the correct thing to do in a particular case is tell the
processor not to load anything extra into the cache, just get what's
needed, because the next several instructions will be loading words
from all over the place, and just because location A was loaded is
absolutely no indication that location A+1 will be needed anytime
soon.

Or maybe the current instruction is loading a value that won't be
needed again for a loooong time.  In this case, we'd like to tell the
cache to just hand the word to the processor, and not waste any space
keeping a copy for itself.

A good compiler might notice all of cases, and construct the
appropriate whisper for that third slot of the superscalar
instruction.

Does anyone else have thoughts on this?  What kinds of things would
_you_ put in the slots of your superscalar chip(s)?
--
	Ben Chase <bbc@rice.edu>, Rice University, Houston, Texas