Path: utzoo!utgpu!jarvis.csri.toronto.edu!cs.utexas.edu!usc!brutus.cs.uiuc.edu!uakari.primate.wisc.edu!ames!amdcad!crackle!mikep From: mikep@crackle.amd.com (Mike Parker) Newsgroups: comp.lang.postscript Subject: Re: FPU for PostScript Message-ID: <28739@amdcad.AMD.COM> Date: 11 Jan 90 03:39:53 GMT References: <1666@intercon.com> <17582@rpp386.cactus.org> <1990Jan9.044252.18617@ico.isc.com> <1990Jan9.170357.25101@ux1.cso.uiuc.edu> <1990Jan9.182332.8554@cs.rochester.edu> <1703@intercon.com> Sender: news@amdcad.AMD.COM Reply-To: mikep@crackle.AMD.COM (Mike Parker) Organization: Advanced Micro Devices, Inc. Sunnyvale CA Lines: 94 I'll just remove all attributions for fear of getting them all wrong... | | > Experience with a pure software implementation | > of PostScript (of which the LaserWriter is a good example) gives us an | > understanding of what parts of the implementation would benefit most | > from hardware support. | | Another thing that I would imagine it's good for is that by running | the implementation on something like a UNIX box, you can profile it | and actually look at where the time goes. This is critical for finding | out what will actually make the most difference when you speed it up. There are far too many other first order effects. I've spent a lot of time trying to get a handle on where PS spends its processing time and I do have some hard data (was it Dick Dunn that wanted numbers?). First, the processor makes a big difference. I work for AMD so I really understand the Am29000 much better than others, but it is clear that an external shifter might not help the Am29000 as much as say the 68000. Case in point, there are a lot of bit-blt accelerators available for 68000 (like Cirrus chip) but our bit-blt code for Am29000 is completely memory bound, the only external hardware that would help is a faster memory system. The memory system is another key factor. One example: On one particular board, the Am29000 running Phoenix clone with the Am29027 FPU is 46x a Laserwriter Plus while without the FPU it is 30x the Plus. But it would be all wrong to say that the FPU gives a 50% boost to performance because we have other boards where the boost is much larger and others where it is much smaller. Choice of software is also a key contributor. Another clone, Pipeline Associates, goes from 5.9x the NTX without the FPU to 10.2x the NTX with the FPU on the same board as the previous Phoenix numbers. So it would appear that Pipeline is more FP dependent than Phoenix. I'm told by people who probably do not know that Adobe is very FP independent, so maybe they'd see less of a hit. Real soon now I'll be able to quote similar numbers for Bauer/uSoft. Further evidence that the probelm is SW vendor dependent is that the Pipeline people worked long and hard to improve performance with and without the FPU and were able to make very large differences and to close the gap significantly. In particular we found that basic add, sub, mul, div were not nearly the culprits that a certain few transcendentals were. Hand coded transcendental routines from Pipeline made huge performance differences in the non-FPU case for some files. Pipeline already had the advantage of a pure integer font rendering mechanism (Nimbus-Q), they changed their bezier solver to pure integer as well as a few other key routines. It was a lot of work and many less caring clone vendors ahven't done the exercise. Being older and bigger, it stands to reason that Adobe has worked pretty hard on this. So profiling on a Laserwriter, or worse yet a UNIX box which might have a memory system very unlike a printer isn't really going to give data applicable to PostScript printers as a whole. | | > (1) Low-level raster manipulations, principally painting character | > bitmaps and filling trapezoids located at arbitrary bit boundaries. | > For typical pages, this activity dominates everything else if all | > characters are already in the font cache. | | This sounds like a good candidate for hardware. The experience of some | of the PostScript clone controllers seems to show that a TI 34010 graphics | processor or an AMD 29000 can significantly improve the interpreter's | ability to lug bits around. Thanks for the plug (all others flame me when the advertising content exceeds the hard data). I'm not so sure that the low-level raster stuff dominates. I've been told that the split is nearly 50/50. I tend to agree with the earlier poster who said that it varies greatly for different pages. But I have hard evidence that it isn't all that low in the case cited where the page is all text and all hits in font cache. We have a 9 page pure text document that we have run on all sorts of configurations. I believe that both Pipeline and Phoenix use the same blit code (supplied by AMD) and yet they get very different results. The first few pages show large differences due to differences in character rendering time for font cache misses, and you can see the time per page curve down exponentially to an asymptote at about the fourth page. At the asymptotee, the Pipeline runs at roughly 0.5 seconds per page on the same exact hardware where the Phoenix code runs at about 0.75 seconds per page. I can't see where any of the difference is anything but "interpretation" (as opposed to raster file manipulation). I have a plan and would like some input on it's validity. We'll take the same exact hardware except we'll change the serializer crystal so we can run at 400 dpi and we'll tell the code to run at 400 dpi. We'll run a variety of pages at both resolutions. It seems like some simple algebra will then give us the intrepretation/raster split. We'll have 87% more pixels so if we take 20% longer to run a file then raster processing time must be 20/87 or 33% of the total task. If enough of you say that the experiment is valid, I'll run it, otherwise I'll run it and just not tell anybody. Please blame all gross spelling errors on a noisy line... mikep Mike Parker mikep@amdcad.AMD.COM