Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!cs.utexas.edu!usc!elroy.jpl.nasa.gov!jpl-devvax!lwall From: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Newsgroups: comp.lang.perl Subject: Re: Reducing data space of a perl program Message-ID: <6782@jpl-devvax.JPL.NASA.GOV> Date: 13 Jan 90 01:29:10 GMT References: <1220@zip.eecs.umich.edu> Reply-To: lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) Organization: Jet Propulsion Laboratory, Pasadena, CA Lines: 104 In article <1220@zip.eecs.umich.edu> don@zip.eecs.umich.edu (Don Winsor) writes: : We have been working with a large and complicated perl program that : uses lots of memory (the resident set is up to 8 megabytes). To try : to optimize it, we would like to find out which data structures are : using the most space. Has anyone done this? Does anyone know of a : good way to determine how much storage a perl array or associative : array is consuming? The ideas that come to my mind are to either : write some kind of perl utilities to traverse arrays and find their : total size, or to compile perl with -g and use dbx to dig through : it's memory. Any advice or suggestions would be appreciated. Which way you want to do it depends on how accurate you want to be. The overhead for a normal array @foo (assuming non-sparse, unshrunk): sizeof(STAB)+sizeof(STBP)+sizeof(ARRAY)+ elems * sizeof(STR*) * Aslop * Mslop + elems * sizeof(STR) * Mslop + $#foo sum maxlen($foo[$i]) * Mslop $i = 0 where elems is the largest number of elements the array has held, maxlen($foo[$i]) is either the length of the longest string value ever stored there plus a few bytes, with a minimum of 24 if the element was assigned a floating point value and was subsequently referenced as a string, Aslop is the amount wasted at the end of the array of pointers to STR, in the range of (1.00 .. 1.20) (unless perl detects that you are shifting off the front and pushing onto the end, in which case Aslop gets pushed up to about 10.0!!!), Mslop is the amount wasted at the end of all malloced strings, in the range of (1.00 .. 2.00) on BSD systems. From within the language it's possible (knowing how the elements have been used) to estimate maxlen for each string, but not possible to determine it precisely. It's not possible to determine the precise value of Aslop unless you know the history of which subscripts were created when. It's possible to estimate individual Mslop's if you know your Aslop of the array and the maxlen's of the individual strings, since it's a matter of rounding up to the next power of 2 (on BSD). Statistically, your average Mslop will be about 1.5, but it would be easy to construct pathological cases. The situation for associative arrays is similar. The differences are that it's more difficult to tell how many strings are actually in use, and there's an extra structure for each element to hold the key pointer and hash value, and the extra string for the key itself. The overhead for an associative array %bar: sizeof(STAB)+sizeof(STBP)+sizeof(HASH)+ tblsize * sizeof(STR*) * Mslop + elems * sizeof(HENT) * Mslop + elems * sizeof(STR) * Mslop + sum maxlen($bar{$key}) * Mslop (over the keys of the array) + sum maxlen($key) * Mslop (over the keys of the array) where tblsize is the size of the hash table, elems is the number of elements in the array, and the other fields are as before. The tblsize can be determined by saying ($buckets_used, $tblsize) = split(/\//, %bar); elems will be a little larger than $buckets_used, because most buckets will only contain one element, but to get an exact figure you'd have to say @foo = values(%bar); $elems = $#foo + 1; The Mslop for the hash table size should be fairly constant since the hash tables are always a power of two in size. If you have a memory allocator that allocates in 2^n-k chunks, where k is greater than 0, you could waste 2^n-2k space for each table of size 2^n. As you can see, determining memory usage from within a script is an exercise in estimation. As for using dbx, I wouldn't want to try it. Just locating the symbol table entry for an array is a non-trivial operation. One tires of typing... Now, if you ran it under Sabre-C, it'd be easier to follow all those pointers around. (Wish I had Saber-C, sigh...) Your best bet might be to try to instrument the perl code. There's already instrumentation code in there when you say -DLEAKTEST. Unfortunately, it just counts the number of allocations, not their sizes. But it would be fairly easy to add in the sizes. Then you could find out the size of various things that were allocated in various places. (Unfortunately, it wouldn't give you a collective figure for the size of an array.) But you could write a routine to go through and find that out. If you knew what your malloc headers looked like, you could probably even account for Mslop. One could do something silly like making the defined operator return the size (used and allocated) of an object if, say -D64 was given. One could even define a new operator for it, if it turned out to be generally useful. Well, that's enough for the moment. Larry