Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!usc!brutus.cs.uiuc.edu!uakari.primate.wisc.edu!ames!amdcad!crackle!mikep
From: mikep@crackle.amd.com (Mike Parker)
Newsgroups: comp.lang.postscript
Subject: Re: FPU for PostScript
Message-ID: <28739@amdcad.AMD.COM>
Date: 11 Jan 90 03:39:53 GMT
References: <POLLACK.89Dec4133944@toto.cis.ohio-state.edu> <1666@intercon.com> <17582@rpp386.cactus.org> <1990Jan9.044252.18617@ico.isc.com> <1990Jan9.170357.25101@ux1.cso.uiuc.edu> <1990Jan9.182332.8554@cs.rochester.edu> <1703@intercon.com>
Sender: news@amdcad.AMD.COM
Reply-To: mikep@crackle.AMD.COM (Mike Parker)
Organization: Advanced Micro Devices, Inc. Sunnyvale CA
Lines: 94

I'll just remove all attributions for fear of getting them all wrong...
| 
| > Experience with a pure software implementation
| > of PostScript (of which the LaserWriter is a good example) gives us an
| > understanding of what parts of the implementation would benefit most
| > from hardware support.
| 
| Another thing that I would imagine it's good for is that by running
| the implementation on something like a UNIX box, you can profile it
| and actually look at where the time goes.  This is critical for finding
| out what will actually make the most difference when you speed it up.

There are far too many other first order effects.  I've spent a lot of
time trying to get a handle on where PS spends its processing time and
I do have some hard data (was it Dick Dunn that wanted numbers?).  First,
the processor makes a big difference.  I work for AMD so I really understand
the Am29000 much better than others, but it is clear that an external
shifter might not help the Am29000 as much as say the 68000.  Case in point,
there are a lot of bit-blt accelerators available for 68000 (like Cirrus chip)
but our bit-blt code for Am29000 is completely memory bound, the only
external hardware that would help is a faster memory system.

The memory system is another key factor.  One example:  On one particular
board, the Am29000 running Phoenix clone with the Am29027 FPU is 46x a
Laserwriter Plus while without the FPU it is 30x the Plus.  But it would be all wrong to say that the FPU gives a 50% boost to performance because we have
other boards where the boost is much larger and others where it is much smaller.

Choice of software is also a key contributor.  Another clone, Pipeline
Associates, goes from 5.9x the NTX without the FPU to 10.2x the NTX with the
FPU on the same board as the previous Phoenix numbers.  So it would appear that
Pipeline is more FP dependent than Phoenix.  I'm told by people who probably 
do not know that Adobe is very FP independent, so maybe they'd see less of
a hit.  Real soon now I'll be able to quote similar numbers for Bauer/uSoft.

Further evidence that the probelm is SW vendor dependent is that the
Pipeline people worked long and hard to improve performance with and
without the FPU and were able to make very large differences and to
close the gap significantly.  In particular we found that basic add,
sub, mul, div were not nearly the culprits that a certain few transcendentals
were.  Hand coded transcendental routines from Pipeline made huge
performance differences in the non-FPU case for some files.  Pipeline
already had the advantage of a pure integer font rendering mechanism
(Nimbus-Q), they changed their bezier solver to pure integer as well
as a few other key routines.  It was a lot of work and many less caring clone
vendors ahven't done the exercise.  Being older and bigger, it stands to
reason that Adobe has worked pretty hard on this.

So profiling on a Laserwriter, or worse yet a UNIX box which might have
a memory system very unlike a printer isn't really going to give data
applicable to PostScript printers as a whole.
| 
| >   (1) Low-level raster manipulations, principally painting character
| > bitmaps and filling trapezoids located at arbitrary bit boundaries.
| > For typical pages, this activity dominates everything else if all
| > characters are already in the font cache.
| 
| This sounds like a good candidate for hardware.  The experience of some
| of the PostScript clone controllers seems to show that a TI 34010 graphics
| processor or an AMD 29000 can significantly improve the interpreter's
| ability to lug bits around.

Thanks for the plug (all others flame me when the advertising content
exceeds the hard data).  I'm not so sure that the low-level raster
stuff dominates.  I've been told that the split is nearly 50/50.  I tend to
agree with the earlier poster who said that it varies greatly for different 
pages.  But I have hard evidence that it isn't all that low in the
case cited where the page is all text and all hits in font cache.  We have
a 9 page pure text document that we have run on all sorts of configurations.
I believe that both Pipeline and Phoenix use the same blit code (supplied
by AMD) and yet they get very different results.  The first few pages show
large differences due to differences in character rendering time
for font cache misses, and you can see the time per page curve down
exponentially to an asymptote at about the fourth page.  At the asymptotee,
the Pipeline runs at roughly 0.5 seconds per page on the same exact
hardware where the Phoenix code runs at about 0.75 seconds per page.  I
can't see where any of the difference is anything but "interpretation"
(as opposed to raster file manipulation).

I have a plan and would like some input on it's validity.  We'll take the
same exact hardware except we'll change the serializer crystal so we can
run at 400 dpi and we'll tell the code to run at 400 dpi.  We'll run
a variety of pages at both resolutions.  It seems like some simple algebra
will then give us the intrepretation/raster split.  We'll have 87% more
pixels so if we take 20% longer to run a file then raster processing time
must be 20/87 or 33% of the total task.  If enough of you say that the
experiment is valid, I'll run it, otherwise I'll run it and just not tell
anybody.


Please blame all gross spelling errors on a noisy line...

mikep
Mike Parker
mikep@amdcad.AMD.COM