Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!ginosko!cs.utexas.edu!uunet!twwells!bill From: bill@twwells.com (T. William Wells) Newsgroups: comp.lang.c Subject: Storing substrings: some real C code in comp.lang.c! Message-ID: <1989Oct25.091519.10718@twwells.com> Date: 25 Oct 89 09:15:19 GMT Organization: None, Ft. Lauderdale, FL Lines: 150 I have been beating my head against the wall, looking for speed, *speed*, and MORE SPEED in this program I'm working on. At only 800 WPM (on my 16MHz '386) it is not exactly a speed demon. Especially since it has to be run on large lists of words, none of which are smaller than 80,000 words. I'm not going to say exactly what the code is for (confidentiality and all that), but the immediate problem I'm trying to solve is storing all the substrings of the input words, up to some run time specified length limit, so that I can get to them *very* quickly. I also don't have lots of memory. Well, actually, I do, but I'm spending it on a *much* bigger data structure: I'm storing information on *pairs* of these substrings.... Originally, I had a trie (urk, what a word!), but that proved too slow. What I have now, and am presenting here, is a variation on it. Here's what I did. I still have a tree, of sorts, but the only pointers in it are parent pointers. The tree is embedded in a hash table. The elements of the hash table contain a parent pointer and a character. To find a string, hash the string and check each element of the table from that hash point, looking up the parent pointers to find the other characters in the string. This code, BTW, takes 60% of the time of the old trie routine. Since it, now, takes a third of my program's time, you can see that I'm still likely to profit by speeding it up. You might be wondering why I don't just store the strings in the table and be done with it. Well, first, storing the strings means that each hash entry either needs some dynamic allocation (expensive!) or a fixed string area, imposing a fixed upper bound on the string length (undesirable). Moreoever, the nature of the problem is such that, if a string is stored in the table, so is every prefix of the string. And the rest of the program depends on being able to find a prefix of a string very quickly. This code is a bit rough, since I'm likely to tinker further with it to get the speed I need, and it shows some of its ancestry, but I thought some people might be interested. Also, I've just picked it out of the program; that means it ought to work, but, on the other hand, I could have picked it incorrectly. And, maybe, we'll get to talk about some *C*, instead of the fever dreams that have been infesting this group lately. :-( Speaking of which, if anyone has suggestions for speeding this up, besides loop unrolling which is what I'm going to try next, I'd appreciate hearing about them. Better algorithms would especially be appreciated. And, oh yes, assembly is not an option. It has to run on my '386 and three different Sun platforms. And whatever else the company might buy next year. #define ARCSIZE 16411 /* size of hash table for arcs, prime */ typedef struct ARC { struct ARC *ar_parent; /* parent of this arc */ short ar_label; /* character labeling the arc */ } ARC; ARC Arc[ARCSIZE + 1]; /* hash table for arcs */ long Arc_call; /* number of calls to add to the arcs */ long Arc_cnt; /* number of arcs assigned */ long Arc_miss; /* number of skipped arc entries */ /* This routine stores a string in the arc table. It returns the arc of the last character in the string. */ ARC * store_string(str, ep) char *str; /* start of the string to add */ char *ep; /* end of the string */ { register ARC *ap1; register ARC *ap2; register char *ptr; unsigned long hash; /* A null string is represented by a null pointer. */ if (str == ep) { return (0); } /* Compute the hash code for the string. */ hash = 0; for (ptr = str; ptr < ep; ++ptr) { hash += (hash << 5) + *ptr + 1; } /* Search till we have a null entry or have found a match. We scan first from the hash position to the end of the table; note the use of the nul character at the end of the table as a sentinel. */ ++Arc_call; for (ap1 = &Arc[hash % ARCSIZE]; ap1->ar_label; ++ap1) { ap2 = ap1; ptr = ep; while (ap2->ar_label == *--ptr) { if (!(ap2 = ap2->ar_parent)) { if (ptr == str) { return (ap1); } break; } else if (ptr == str) { break; } } ++Arc_miss; } /* If we hit the sentinel, scan from the start of the table. */ if (ap1 == Arc + ARCSIZE) { for (ap1 = Arc; ap1->ar_label; ++ap1) { ap2 = ap1; ptr = ep; while (ap2->ar_label == *--ptr) { if (!(ap2 = ap2->ar_parent)) { if (ptr == str) { return (ap1); } break; } else if (ptr == str) { break; } } ++Arc_miss; } } /* If we got here, the string is not in the table. We also have an empty entry which we can use. So, we use that and then call ourselves recursively to put the prefix of this string into the table. Note the kludge with the two assignments of ar_label. The point is to assign the entry but to make it have an illegal value so that the matching code doesn't accidentally succeed. */ if (Arc_cnt++ == ARCSIZE) { fflush(stdout); fprintf(stderr, "arc table overflow\n"); exit(1); } ap1->ar_label = -1; ap1->ar_parent = store_string(str, --ep); ap1->ar_label = *ep; return (ap1); } --- Bill { uunet | novavax | ankh | sunvice } !twwells!bill bill@twwells.com