Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!gem.mps.ohio-state.edu!usc!apple!sun-barr!newstop!sun!pepper!cmcmanis
From: cmcmanis%pepper@Sun.COM (Chuck McManis)
Newsgroups: comp.sys.amiga
Subject: Re: 3000 wishes
Message-ID: <125964@sun.Eng.Sun.COM>
Date: 6 Oct 89 22:28:03 GMT
References: <4875@cps3xx.UUCP>
Sender: news@sun.Eng.Sun.COM
Reply-To: cmcmanis@sun.UUCP (Chuck McManis)
Organization: Sun Microsystems, Mountain View
Lines: 100

In article <4875@cps3xx.UUCP> porkka@frith.UUCP (Joe Porkka) writes:
> Make sprites work in hires, and allow them to be as wide as a playfield;
> make them as deep (bitplane wise) too... make it so that you can have 32 
> indipendent sprites per scan line.

I designed one hardware graphics systems, and helped with another that
was part of a multiperson group that was working on the Intel 82786. 
Both had something similar to this, and both attacked the problem in 
different ways. The difficulties that come up are similar though.

Sprites and windows can be thought of as a memory management problem. One
linear space (the viewscreen) may be composed of several discrete chunks
of a larger workspace. On a pixel by pixel basis you get to decide where
that pixel will come from in the workspace. Fortunately, you can make some
optimizations because you know that pixels will be accessed in sequential
order. The problem is access time for the translation tables. Since a scan
line may be as short as 14 microseconds (for a non interlaced 1K X 1K display
at 66Hz) you need to do pixel translations in as few as 14 nanoseconds. 
And if you can do 14 nanosecond translations then you can have an arbitrary
number of windows aligned on arbitrary boundaries on your screen. Now however
if you want to do "sprites" which can be "transparent" you may do your 
translation, only to find out that the sprite you translated two has a 
transparent pixel, and now you have to find the pixed "under" it. If you 
used up your 14 nanoseconds getting to the first sprite, your hosed because
the beam will move on. Anyway, it isn't this bad at NTSC rates. With a 
640 X (200/400) screen and a 15Khz scan rate, you only have to map pixels
within 99 nanoseconds. So visualize the following scene at the pixel 
multiplexor :

    Beam Position
        X    Y
        |    |
        V    V
	sprite 0 ------\
	sprint 1 ------\\
	sprite 2 ------\\\             +-----+
	sprite 3 -------\\\\ +-----+   |     +---> Red
	sprite 4 ------------+ MUX +---+ DAC +---> Grn
	sprite 5 -------//// +-----+   |     +---> Blu
	sprite 6 -------///            +-----+
	sprite 7 -------//
	playfield ------/

So the MUX or some sort of arbitration circuit has to lookup the pixel color
of sprite 0, and if it's transparent fall through to sprite 1, ..., to sprite
7 and then finally pick up the playfield data. All within the 99ns the beam
has to find that information. Common ways to cheat are to "freeze" the values
and start queueing up stuff from memory when HBLANK hits, and while you get 
behind in fetching stuff you started out ahead, so that the beam just catches
up to you when you hit the next HBLANK.

So an expensive way to do this might be to put each window in the "proper"
place in it's own bank of VRAMs. [You might be able to multiplex windows
that didn't overlap like VSprites with clever programming.] Then you 
scan all banks of VRAM simultaneously for data. In the display unit you
simply keep a bunch of address comparators that hold the LeftEdge, TopEdge,
RightEdge, and BottomEdge values, all ANDed together so that they generate
a "1" bit when the beam is in that "window". Since the propgation time on
these comparators is pretty fast (like 10ns) we don't have to worry about
that. If you are clever and want to make them sprite like, you can put 
a "zero" detect in to AND with the comparator output and that would pull
the "we're in this window" bit down if the pixel at that location was zero.
Now you divide the pixel clock into 4 subclocks (each ~25ns in this case)
and time it like this :

	0		1		2		3
	 _______	 _______	 _______	 _______
clock	/	\_______/	\_______/	\_______/	\_______
		________________________________________________________
inwin	-------<________________________________________________________>-
				________________________________________
iszero	-----------------------<________________________________________>-
						________________________
is_top	---------------------------------------<________________________>-
							________________
valid_pixel -------------------------------------------<________________>-

So if you can read my crude timing diagram, everything latches on the falling
edge of C0 (4X pixel clock) and that ends up that by the rising edge of
phase 3 you can clock the "true" pixel onto the video shifter bus and 
then out to the dacs. Note that only on the falling edge of phase 2 will
you have an accurate picture of which pixel is "topmost" this from an
arbitration of priorities between the falling edge of phase 1 and before
the falling edge of phase 2. That means you have to arrive at the correct
priority in about 25ns, given a setup time of 5ns and a settling time of
3 - 4ns, you have to keep those propogation times down. You can probably
do this with a XOR priority encoder scheme. Anyway, for 32 "window/screen/sprites"
you will need 32 banks of VRAM (again this will be the maximum number of windows
on a line, if you can live with fewer windows/line you could reduce that.) 
Assuming 8 bit pixels, (this is an improvement after all) and a 640 X 200+
screen you will need 512KB of VRAM for each window, leaving you with 
16MB of VRAM for the display. Which is definitely doable but it will get  
a bit expensive. Interestingly enough on a monochrome screen you only need
2MB of VRAM, and that would make for a pretty awesome X terminal or some
such.

--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@sun.com
These opinions are my own and no one elses, but you knew that didn't you.
"If I were driving a Macintosh, I'd have to stop before I could turn the wheel."