Path: utzoo!utgpu!news-server.csri.toronto.edu!rpi!zaphod.mps.ohio-state.edu!mips!cs.uoregon.edu!ogicse!intelhf!ichips!inews!hopi!bhoughto
From: bhoughto@hopi.intel.com (Blair P. Houghton)
Newsgroups: comp.lang.c
Subject: Re: Efficient dynamic storage question
Message-ID: <3928@inews.intel.com>
Date: 21 Apr 91 04:48:56 GMT
References: <32121@usc> <1991Apr20.230021.25248@Think.COM>
Sender: news@inews.intel.com
Organization: Intel Corp, Chandler, AZ
Lines: 93

In article <1991Apr20.230021.25248@Think.COM> barmar@think.com (Barry Margolin) writes:
>In article <32121@usc> ajayshah@alhena.usc.edu (Ajay Shah) writes:
>>Question: is it much more efficient to realloc in groups of (say)
>>100 observations at a time instead of calling realloc once for each

Tons.  If you need to use an array (though not a declared
one) instead of, e.g., a linked-list, use one that acts
like an array.  If you're worried about empty slots past
the last filled slot, you can always use realloc(3) to
shrink the array to fit your data (this is probably what
it was invented for...)

>Yes, it's generally better to minimize the reallocs.  Consider an
>implementation of malloc that allocates a new object right where the
>previous object ended (perhaps with some header information interspersed).
>The reallocs would have to copy because the previous malloc would prevent
>the realloc from simply extended the array.

i.e.  Realloc can't extend an array into a space that is used already.

>Thus, at worst you would probably get about the same performance whether or
>not you grow by larger chunks, but at best you may speed things up quite a

Not larger, always; sometimes _smaller_ chunks are better.

If you know what kind of data you're reading, you should be
able to get a distribution of the sizes of arrays you can
expect to use for that data.  If the data often reach
10.00Mb, but rarely 10.02Mb, it's very wasteful to allocate
another 10Mb of heap just because you've reached 10Mb.

I was playing around with this sort of thing a couple of
months ago, but productivity got in the way and I never
implemented it on a well-characterized database.

The basic idea was to keep an array of the likely array
sizes, and reallocate to the next size when one was reached.

The first step in the development (can you say "bottom-up
design"?  I knew that you could (-: ) was generating some
distributions.  So what I have is a filter (called
'loghist') that takes a stream of numbers (actually, lines
that start with numbers, for you du(1)-output fans) and
generates a cumulative histogram[*] with a logarithmic
ordinate axis.

It's not nearly clean enough for comp.sources.*, but I'd be
glad to mail it to anyone who wants it.

The next step is to translate the c.h. (or chuck it out
entirely, if necessary; I don't know, because this is about
where the deadline on another project kicked-in...) into
the set of numbers which divide the data into intervals
that have some rational probabilities (i.e., to reshape the
histogram by choosing uneven increments in the ordinate).
Whether to use a probability sequence of {.5, .75, .875, ...}
or {.9, .99., .999, ...} I haven't determined.  That
may best be determined by the kind of data involved and
the economics of memory usage vs. reallocation-execution
time on the system.

Thus it should be possible to give the optimal size of an
incremental reallocation.  Running loghist on my own files
(the output of 'du -a ~') to see what sort of space
"general user-file bytes" data would need, I found that
my files are fairly normal-distribution distributed in
the 0-5k range (there's a lot of saved news articles in there,
so it peaks around 2k; it's also more accurately a Student's-t
distribution, for you statistics fans), then fairly evenly
(and low) up to about 20k, then sparse until a few clusters
(of similar files, usually multiple, small-difference versions
of specialized data) show up in the 100k-10Mb range.

Creating an array to read this, I'd start with 5k (about 90%
of the files), then if a realloc was needed, I'd add enough
space to reach 20k, and then enough to reach each of the
clusters, as necessary.

Your mileage may vary.

				--Blair
				  "It took longer to explain
				   it than it did to do it..."

[*] A cumulative histogram is to a histogram what the
cumulative distribution function is to the probability
distribution function:  the probability that an event will
have its result in a certain interval is the difference
between the values of the c.d.f. at the endpoints of that
interval.  A cumulative histogram is, therefore, the
statistical analogue of the cumulative distribution
function (which is deduced, rather than statistical, and
is normalized so that F(oo) = 1).