Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!bu.edu!m2c!jjmhome!smds!rh From: rh@smds.UUCP (Richard Harter) Newsgroups: comp.lang.c Subject: Re: Efficient STRing CoMPares? Summary: String objects for fun and profit Message-ID: <352@smds.UUCP> Date: 21 Mar 91 08:11:03 GMT References: <1193@caslon.cs.arizona.edu> <15496@smoke.brl.mil> <11111@dog.ee.lbl.gov> Organization: SMDS Inc., Concord, MA Lines: 47 In article <11111@dog.ee.lbl.gov>, torek@elf.ee.lbl.gov (Chris Torek) writes: > In article <15510@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: > >... in the vast majority of applications the strings being > >compared, even in cases where they match, are contained in different > >storage locations. E.g. > > if ( StrEq( buffer, "EOF" ) ) ... > Of course, you can build yourself a `string pool' system, in which each > distinct string appears exactly once, and then two strings match iff their > pointers match . . . but this merely offloads the `compare for truly > equal' into the string pool system. Just so. In one of our programs where there was a great deal of manipulation of strings we did a string "object" package (C not C++). For each unique string under management there is a struct which looks something like struct string_object { char *c; /* The string itself */ int n; /* Its length */ int rc; /* Reference count */ int sz; /* Allocated space for string */ struct string_object *link; /* Data access */ }; The pool used a hash table with short buckets (the link field). In the case in question there are many more references to existing strings (and multiple instances) then there are references to new strings so using pointers into the pool is a big win. The only object primitives are add_string and delete_string. Delete_string decrements a reference count and does the appropriate when the count goes to zero. Add_string has to do the compare, which is more expensive than a simple compare because it has to compute the hash index (which only needs to be very quick and dirty). If the string exists the reference count is upped; otherwise a new object is created (actually taken off a free list.) The size field is there because the strings are copied into the struct. This is more overhead, but it is worth it in this case, because the strings in question are often substrings. The upshot is that there is very little allocation/deallocation which is much more expensive than string compares. It would be interesting to hear how other people have handled this sort of problem. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.