Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!usc!rutgers!cmcl2!lanl!cochiti.lanl.gov!jlg From: jlg@cochiti.lanl.gov (Jim Giles) Newsgroups: comp.arch Subject: Re: new instructions Message-ID: <24216@lanl.gov> Date: 22 May 91 16:38:46 GMT References: <1991May22.001620.751@craycos.com> <1991May23.084258.5062@kithrup.COM> Sender: news@lanl.gov Organization: Los Alamos National Laboratory Lines: 32 In article <1991May23.084258.5062@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: |> [...] |> You can do pop count quite quickly with table lookup. Yes, you have to |> break it up, but, on a byte-addressable machine, it's not a major hassle. |> You will need a 256-byte table, of course, but that, also, isn't a major |> hassle. CLZ I can also do somewhat quickly with table lookup on a |> byte-addressable machine. Now, let's see. A pop count instruction is 4 clocks on an X/MP. _ONE_ memory access is 14 (assuming no bank conflicts). So, to do a pop count on just one word (64 bits) using a 256 element table would take 8 memory references (start them together so they pipeline - still 16 clocks to issue all those references), plus the time taken to break the word into bytes, plus the time to add the results of the loads. I'd guess a minimum of 40 clocks total. Does this count as "quite quickly"? Leading zero count is a 3 clock instruction. I doubt that could be implemented in fewer clocks than the pop count if you did it with a table. Now, slower CPUs (for which the memory would seem _relatively_ faster) would take fewer clocks to load the table entries. If the table _happened_ to be in the cache (for machines that _have_ a cache) the memory fetches are sometimes only a couple of clocks. Even so, on such a slower machine, the hardware pop count and lead zero instructions could also be implemented as _relatively_ faster instructions - one clock each wouldn't be a surprise. It would amaze me to find any machine (on which the test could be done) where a table lookup came within an order of magnitude of a hardware instruction on these functions. J. Giles