Path: utzoo!attcan!utgpu!jarvis.csri.toronto.edu!mailrus!uwm.edu!uakari.primate.wisc.edu!ginosko!cs.utexas.edu!uunet!twwells!bill
From: bill@twwells.com (T. William Wells)
Newsgroups: comp.lang.c
Subject: Storing substrings: some real C code in comp.lang.c!
Message-ID: <1989Oct25.091519.10718@twwells.com>
Date: 25 Oct 89 09:15:19 GMT
Organization: None, Ft. Lauderdale, FL
Lines: 150

I have been beating my head against the wall, looking for speed,
*speed*, and MORE SPEED in this program I'm working on. At only
800 WPM (on my 16MHz '386) it is not exactly a speed demon.
Especially since it has to be run on large lists of words, none of
which are smaller than 80,000 words.

I'm not going to say exactly what the code is for (confidentiality
and all that), but the immediate problem I'm trying to solve is
storing all the substrings of the input words, up to some run time
specified length limit, so that I can get to them *very* quickly.
I also don't have lots of memory. Well, actually, I do, but I'm
spending it on a *much* bigger data structure: I'm storing
information on *pairs* of these substrings....

Originally, I had a trie (urk, what a word!), but that proved too
slow. What I have now, and am presenting here, is a variation on
it.

Here's what I did. I still have a tree, of sorts, but the only
pointers in it are parent pointers. The tree is embedded in a hash
table. The elements of the hash table contain a parent pointer and
a character. To find a string, hash the string and check each
element of the table from that hash point, looking up the parent
pointers to find the other characters in the string.

This code, BTW, takes 60% of the time of the old trie routine.
Since it, now, takes a third of my program's time, you can see
that I'm still likely to profit by speeding it up.

You might be wondering why I don't just store the strings in the
table and be done with it. Well, first, storing the strings means
that each hash entry either needs some dynamic allocation
(expensive!) or a fixed string area, imposing a fixed upper bound
on the string length (undesirable). Moreoever, the nature of the
problem is such that, if a string is stored in the table, so is
every prefix of the string. And the rest of the program depends on
being able to find a prefix of a string very quickly.

This code is a bit rough, since I'm likely to tinker further with
it to get the speed I need, and it shows some of its ancestry, but
I thought some people might be interested. Also, I've just picked
it out of the program; that means it ought to work, but, on the
other hand, I could have picked it incorrectly.

And, maybe, we'll get to talk about some *C*, instead of the fever
dreams that have been infesting this group lately. :-(

Speaking of which, if anyone has suggestions for speeding this up,
besides loop unrolling which is what I'm going to try next, I'd
appreciate hearing about them. Better algorithms would especially
be appreciated. And, oh yes, assembly is not an option. It has to
run on my '386 and three different Sun platforms. And whatever
else the company might buy next year.

#define ARCSIZE         16411   /* size of hash table for arcs, prime */

typedef struct ARC {
	struct ARC *ar_parent;  /* parent of this arc */
	short   ar_label;       /* character labeling the arc */
} ARC;

ARC     Arc[ARCSIZE + 1];       /* hash table for arcs */
long    Arc_call;               /* number of calls to add to the arcs */
long    Arc_cnt;                /* number of arcs assigned */
long    Arc_miss;               /* number of skipped arc entries */

/* This routine stores a string in the arc table. It returns the arc of the
   last character in the string. */

ARC *
store_string(str, ep)
char    *str;                   /* start of the string to add */
char    *ep;                    /* end of the string */
{
	register ARC *ap1;
	register ARC *ap2;
	register char *ptr;
	unsigned long hash;

	/* A null string is represented by a null pointer. */

	if (str == ep) {
		return (0);
	}
	/* Compute the hash code for the string. */

	hash = 0;
	for (ptr = str; ptr < ep; ++ptr) {
		hash += (hash << 5) + *ptr + 1;
	}
	/* Search till we have a null entry or have found a match. We scan
	   first from the hash position to the end of the table; note the use
	   of the nul character at the end of the table as a sentinel. */

	++Arc_call;
	for (ap1 = &Arc[hash % ARCSIZE]; ap1->ar_label; ++ap1) {
		ap2 = ap1;
		ptr = ep;
		while (ap2->ar_label == *--ptr) {
			if (!(ap2 = ap2->ar_parent)) {
				if (ptr == str) {
					return (ap1);
				}
				break;
			} else if (ptr == str) {
				break;
			}
		}
		++Arc_miss;
	}
	/* If we hit the sentinel, scan from the start of the table. */

	if (ap1 == Arc + ARCSIZE) {
		for (ap1 = Arc; ap1->ar_label; ++ap1) {
			ap2 = ap1;
			ptr = ep;
			while (ap2->ar_label == *--ptr) {
				if (!(ap2 = ap2->ar_parent)) {
					if (ptr == str) {
						return (ap1);
					}
					break;
				} else if (ptr == str) {
					break;
				}
			}
			++Arc_miss;
		}
	}
	/* If we got here, the string is not in the table. We also have an
	   empty entry which we can use. So, we use that and then call
	   ourselves recursively to put the prefix of this string into the
	   table. Note the kludge with the two assignments of ar_label. The
	   point is to assign the entry but to make it have an illegal value
	   so that the matching code doesn't accidentally succeed. */

	if (Arc_cnt++ == ARCSIZE) {
		fflush(stdout);
		fprintf(stderr, "arc table overflow\n");
		exit(1);
	}
	ap1->ar_label = -1;
	ap1->ar_parent = store_string(str, --ep);
	ap1->ar_label = *ep;
	return (ap1);
}

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com