Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!oliveb!sun!gorodish!guy From: guy@gorodish.Sun.COM (Guy Harris) Newsgroups: comp.arch Subject: Re: Sw vs. Hw BitBlit. Keywords: BitBlit. Message-ID: <61783@sun.uucp> Date: 28 Jul 88 07:07:35 GMT References: <399@ma.diab.se> <1313@ucsfcca.ucsf.edu> Sender: news@sun.uucp Lines: 125 > It's an old debating trick to try to make points against the opposing > view by mis-characterizing it and then arguing against the distorted > image. It's an equally old trick to make a counterfactual assertion and treat it as an axiom.... > Every one posting on this subject advocating cpu data shuffling for > displays has tried to fake us all out by pretending that this is > done in word sized units all neatly aligned on word boundaries. > > Now let's see your software timings for real _bit_blit_ operations > such as moving a block 37 bits wide aligned starting at bit 17 in the > source position and starting at bit 29 in the destination position > on a machine with 32-bit registers and data paths. *Sigh* I don't think *anybody* claimed that all bit moving is "done in word size units all neatly aligned on word boundaries." *HOWEVER*: For applications in terminals, there are three cases of "bitblt" that dominate: drawing characters, scrolling windows and window-window operations such as exchanging off-screen data with the display. These cases also cover the most common graphics operations on personal computers. Drawing a character requires decoding a found structure to find the location of the charcter in the fount bitmap and calling "bitblt" to draw the character on the display. For a general fount format and typical character sizes, over half the total time to draw a character on the Blit goes into overhead: at least one subroutine call and setup, opening the fount, building the argument list for "bitblt", calling "bitblt", and having "bitblt" in turn decode and clip its arguments and decide how to draw the image. Because the characters are so small -- drawing the letter 'a' touches 7 words of memory -- actually changing the pixels in the destination bitmap is relatively unimportant. Our overhead is not unreasonable; the Blit draws about 2500 characters per second in the standard fount, which is 9 pixels (not 8) by 14. An experimental version with eight-bit wide characters drawn only on byte boundaries, that avoided the overhead of calling "bitblt" and used a special fount format that was easy to decode (the current format is somewhat compressed for economy of memory), was only a factor of two faster. This is insufficent speed-up for so great a loss of generality. The second common case of "bitblt" is scrolling a rectangular region of a bitmap, usually the display. Since the word boundaries in the scan lines of a bitmap are at the same place in each line, the speed of scrolling depends primarily on the speed of the MC68000 instruction mov.l %a0@+, %a1@+ or, in C, register long *p, *q; *p++ = *q++; For typical rectangles, the edges, which must be handled with more complicated code, do not dominate the performance. There is nothing hardware can do to accelerate this loop except provide faster memory access. If the display were accessed through a narrower or clumsier interface, it would take longer to move the data. The last common case is shuffling on- and off-screen rectangles. It can be made fast by a simple observation: the off-screen bitmaps are allocated by "balloc", which is given as argument the rectangle on the display occupied by the data. This rectangle is assigned to "rect" in the resulting "Bitmap". "balloc" can therefor allocate the bitmap so that the word boundaries occur in the same places in the image as they do in the display, reducing to the scrolling case the "bitblt" call that copies the data. This is the last feature of the "Bitmap" data structure: "Bitmap.rect" defines not only the co-ordinate system but also the word fragmentation; the "x" co-ordinate modulo 16 is 0 at the first bit of the word in every bitmap. This results in a factor of two to four speed-up for window-shuffling "bitblt" operations and combines neatly with the way textures are generated without diminishing the generality of the graphics primitives. Of course, there is also the wide, non-aligned case of "bitblt" to be supported, but almost by construction it occurs rarely, and the memory and software are clean enough to make it acceptably fast when it is executed. from "Hardware/Software Trade-offs for Bitmap Graphics on the Blit", Rob Pike, Bart Locanthi, and John Reiser, Software-Practice and Experience, Vol. 15(2), 131-151 (February 1985). I tend to believe Rob Pike and company when they say that "for real _bit_blit_ operations such as moving a block 37 bits wide aligned starting at bit 17 in the source position and starting at bit 29 in the destination position on a machine with 32-bit registers and data paths" are not typical (at least in the way they used Blits) except for character painting, where overhead above and beyond the bit-pushing dominates. If you have evidence to indicate that this is not the case, let's see it. In the aforementioned paper, they also discuss timings. They compare a Sun-1 (with a somewhat unusual frame buffer), a Sun-2 (with a conventional frame buffer with a BitBlt chip that acts only on the frame buffer), and the Blit. I don't know how much the Sun-2 with BitBlt chip resembles the "hardware BitBlt" support that has been discussed here, but here are the figures (minus those for the atypical Sun-1 frame buffer); all timings are in milliseconds: Operation Sun-2 Sun-2 Blit (display w/BB chip) (memory, no BB chip) Scroll screen vertically 109 82.2 129 Scroll screen horizontally 110 311 376 Letter 'a' at random positions on the screen 0.34 0.74 0.42 Texturing a random 40x40 square 0.82 1.78 1.60 "The characters were drawn in a 9x14 pixel fount, but the bounding box for the letter 'a' is only 8x7. Both systems used "bitblt" to draw characters, rather than special purpose primitives, and executed clipping code." (from the article) So it appears that an 8MhZ 68000 (Blit) can compete reasonably well with a 10MhZ 68010 (Sun-2), even with the assistance of the Sun-2s BitBlt chip. I don't know why the Sun-2 scrolled vertically memory-to-memory *faster* than it did display-to-display. If a BitBlt chip is reasonably cheap, and can do the whole job, it may be worth it. Note that in the cases shown, you got at most a 3.5x speedup (scroll screen horizontally). For vertical scrolling, you got only 1.18x; for randomly drawing the letter 'a', you got only 1.23x; and for texturing a random 40x40 square, you got 1.95x. How cheap does it have to be for that to be worth it? (The "do the whole job" comes from comments made in the paper that a half-hearted hardware assist can get in the way, rather than help.)