Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!cs.uoregon.edu!ogicse!intelhf!ichips!inews!hopi!bhoughto From: bhoughto@hopi.intel.com (Blair P. Houghton) Newsgroups: comp.lang.c Subject: Re: Efficient dynamic storage question Message-ID: <3928@inews.intel.com> Date: 21 Apr 91 04:48:56 GMT References: <32121@usc> <1991Apr20.230021.25248@Think.COM> Sender: news@inews.intel.com Organization: Intel Corp, Chandler, AZ Lines: 93 In article <1991Apr20.230021.25248@Think.COM> barmar@think.com (Barry Margolin) writes: >In article <32121@usc> ajayshah@alhena.usc.edu (Ajay Shah) writes: >>Question: is it much more efficient to realloc in groups of (say) >>100 observations at a time instead of calling realloc once for each Tons. If you need to use an array (though not a declared one) instead of, e.g., a linked-list, use one that acts like an array. If you're worried about empty slots past the last filled slot, you can always use realloc(3) to shrink the array to fit your data (this is probably what it was invented for...) >Yes, it's generally better to minimize the reallocs. Consider an >implementation of malloc that allocates a new object right where the >previous object ended (perhaps with some header information interspersed). >The reallocs would have to copy because the previous malloc would prevent >the realloc from simply extended the array. i.e. Realloc can't extend an array into a space that is used already. >Thus, at worst you would probably get about the same performance whether or >not you grow by larger chunks, but at best you may speed things up quite a Not larger, always; sometimes _smaller_ chunks are better. If you know what kind of data you're reading, you should be able to get a distribution of the sizes of arrays you can expect to use for that data. If the data often reach 10.00Mb, but rarely 10.02Mb, it's very wasteful to allocate another 10Mb of heap just because you've reached 10Mb. I was playing around with this sort of thing a couple of months ago, but productivity got in the way and I never implemented it on a well-characterized database. The basic idea was to keep an array of the likely array sizes, and reallocate to the next size when one was reached. The first step in the development (can you say "bottom-up design"? I knew that you could (-: ) was generating some distributions. So what I have is a filter (called 'loghist') that takes a stream of numbers (actually, lines that start with numbers, for you du(1)-output fans) and generates a cumulative histogram[*] with a logarithmic ordinate axis. It's not nearly clean enough for comp.sources.*, but I'd be glad to mail it to anyone who wants it. The next step is to translate the c.h. (or chuck it out entirely, if necessary; I don't know, because this is about where the deadline on another project kicked-in...) into the set of numbers which divide the data into intervals that have some rational probabilities (i.e., to reshape the histogram by choosing uneven increments in the ordinate). Whether to use a probability sequence of {.5, .75, .875, ...} or {.9, .99., .999, ...} I haven't determined. That may best be determined by the kind of data involved and the economics of memory usage vs. reallocation-execution time on the system. Thus it should be possible to give the optimal size of an incremental reallocation. Running loghist on my own files (the output of 'du -a ~') to see what sort of space "general user-file bytes" data would need, I found that my files are fairly normal-distribution distributed in the 0-5k range (there's a lot of saved news articles in there, so it peaks around 2k; it's also more accurately a Student's-t distribution, for you statistics fans), then fairly evenly (and low) up to about 20k, then sparse until a few clusters (of similar files, usually multiple, small-difference versions of specialized data) show up in the 100k-10Mb range. Creating an array to read this, I'd start with 5k (about 90% of the files), then if a realloc was needed, I'd add enough space to reach 20k, and then enough to reach each of the clusters, as necessary. Your mileage may vary. --Blair "It took longer to explain it than it did to do it..." [*] A cumulative histogram is to a histogram what the cumulative distribution function is to the probability distribution function: the probability that an event will have its result in a certain interval is the difference between the values of the c.d.f. at the endpoints of that interval. A cumulative histogram is, therefore, the statistical analogue of the cumulative distribution function (which is deduced, rather than statistical, and is normalized so that F(oo) = 1).