Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!novavax!twwells!bill From: bill@twwells.com (T. William Wells) Newsgroups: comp.lang.c Subject: Re: Storing substrings: some real C code in comp.lang.c! Message-ID: <1989Oct29.175536.9344@twwells.com> Date: 29 Oct 89 17:55:36 GMT References: <1989Oct25.091519.10718@twwells.com> <6871@hubcap.clemson.edu> Organization: None, Ft. Lauderdale, FL Lines: 59 In article <6871@hubcap.clemson.edu> hutch%citron.cs.clemson.edu@hubcap.clemson.edu writes: : From article <1989Oct25.091519.10718@twwells.com>, by bill@twwells.com (T. William Wells): : > I have been beating my head against the wall, looking for speed, : > *speed*, and MORE SPEED in this program I'm working on. At only : > 800 WPM (on my 16MHz '386) it is not exactly a speed demon. : > Especially since it has to be run on large lists of words, none of : > which are smaller than 80,000 words. : > : > Bill { uunet | novavax | ankh | sunvice } !twwells!bill : > bill@twwells.com : : First, forget the tree, and the hash. What you need is a Finite State : Automaton. The basic idea for this can be found in the May 1988 : issue of CACM in an article on an implementation of SCRABBLE(tm). My first implementation (a trie) is exactly what an FSA would generate as output. The major cost in it is searching each level of the tree. Knuth shows a way around that; it has been implemented in the patgen program that came (comes?) with TeX. There was also that Programming Pearls article in CACM. (Sorry, exact references are not handy.) However, those have the drawback that a test of each level of the tree is a bit more complex than in mine. Also, the code to insert into those trees is nontrivial. And slow. Yes, I've used these algorithms. : I've implemented a version of this and I can build the FSA in under : a minute (for about 50000 words) on a sun-3 and under 5 minutes on : a 80286. In a recent test, my current implementation has the routine called 454,094 times, 19,194 of which actually add new strings to the table. It takes 24 seconds on my '386, which is about .9 VAX MIPS. I think that beats your FSA. BTW, for those interested, the hashing line in my original code should be replaced with: hash += (hash << 6) + *ptr; The hit rate goes down by a factor of 6! : If possible, you should build the FSA for the words in : a separate processing step since it does take a bit of memory. Not possible, since the strings have to be generated on the fly. To do it otherwise would require generating a *huge* temp file. : PS: Since you are doing something confidential, I assume it is : for *business*. Not a good assumption, since it could be research someone doesn't want yet bruited about, or for the government, etc., but in this case, you are right. --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com