Xref: utzoo comp.graphics:3936 comp.windows.x:6832
Path: utzoo!utgpu!watmath!onfcanim!dave
From: dave@onfcanim.UUCP (Dave Martindale)
Newsgroups: comp.graphics,comp.windows.x
Subject: Re: Luminance from RGB
Message-ID: <16960@onfcanim.UUCP>
Date: 20 Dec 88 16:00:32 GMT
References: <8241@pasteur.Berkeley.EDU> <518@midgard.Midgard.MN.ORG> <16929@onfcanim.UUCP> <7086@venera.isi.edu>
Reply-To: dave@onfcanim.UUCP (Dave Martindale)
Organization: National Film Board / Office national du film, Montreal
Lines: 46

In article <7086@venera.isi.edu> raveling@vaxb.isi.edu (Paul Raveling) writes:
>
>>But it's faster to *implement* almost any function using table lookup.
>
>	This is true for relatively complex functions, but not usually
>	for those that break down easily to simple operations such as
>	shifts and adds.  I've measured speed improvements up to a
>	factor of 14 over ordinary C code in the most extreme case
>	by moving a critical algorithm to assembly language and using
>	this sort of shifty logic.

Is that a factor of 14 speed improvement over table lookup?  Or a
factor of 14 improvement over code that contained multiply or divide in
the inner loop?  A comparison against table lookup is what matters.
What processor was this on?

For shift/add to be faster than table lookup, the time required to do
the shifts and adds must be less than that required to do the table
addressing calculations and the extra memory fetch.  (Note that the
table word will usually be in the cache on machines that have a cache.)

On machines like the VAX, where the table addressing calculations can
be done as part of a "move" instruction but the shifts and adds are
done as separate instructions (and shift is very slow on some models),
table lookup is going to be faster.  On another machine where the table
addressing requires several instructions but a barrel shifter is
available, the shift method will likely be faster.  You have to try
both, or have a good knowledge of the particular model of the
particular architecture of processor you are using, to determine which
is faster.

However, table lookup has some additional benefits.  It is simple
enough that carefully-written C code is likely to generate the best
possible assembler code, so there is no need to use assembler.  A
single copy of the lookup code functions for all possible input and
output widths, as long as the pixels are always stored in words of a
constant width.

In contrast, the shift/add code requires different sequences of
instructions for different input and output pixel bit widths, since the
output may require adding 1 or 2 or 3 or more copies of the input,
shifted by various amounts.  Either you must pre-compile all possible
variations that you could ever need, or compile code during execution,
or just have a general-purpose algorithm that needs tests and branches
within the pixel lookup loop (bye-bye performance).  How do you deal
with this?