Path: utzoo!utgpu!jarvis.csri.toronto.edu!rutgers!mcnc!duke!bet
From: bet@orion.mc.duke.edu (Bennett Todd)
Newsgroups: comp.unix.wizards
Subject: Filename length statistics
Message-ID: <14749@duke.cs.duke.edu>
Date: 14 Jun 89 17:18:59 GMT
References: <19976@adm.BRL.MIL> <4530@ficc.uu.net>
Sender: news@duke.cs.duke.edu
Reply-To: bet@orion.mc.duke.edu (Bennett Todd)
Organization: Diagnostic Physics, Raddiology, DUMC
Lines: 87
In-reply-to: peter@ficc.uu.net (Peter da Silva)

In article <4530@ficc.uu.net>, peter@ficc (Peter da Silva) writes:
>I've added a cumulative total
>
> 392  1   392   0.57%
>   ...
>1533 14 65848  94.96%   <-- Covers most bases.
>   ...
>  23 30 69284  99.92%   <-- Covers virtually all bases.
>   ...
>   1 51 69340 100.00%
>
>14 corresponds to SysV. 30 corresponds to SysV with DIRSIZ doubled. There were
>56 files, or 0.08%, that were longer than this.

Out of curiosity I ran this over the 2187 files under my home directory;
some of the statistics came out a little differently. Specifically, the
ones shown above come out like so for me:

 1   237 10.84%   237  10.84%
  ...
14    55  2.51%  1805  82.53%
  ...
30     4  0.18%  2166  99.04%
  ...
53     1  0.05%  2187 100.00%

(I just noticed my column ordering is different; I used the awk program
someone posted, which I append at the end).

The 14-character long names only handle ~83% of my filenames (this
includes directory names, and in particular includes "." and ".." for
every directory, so there is some structural weighting acting against my
statistics here).  Further, the 30 character names still left nearly 1%
of my choices, 21 out of 2187, chopped. Some of our users would show
much higher filename length distributions, others lower. Having a shell
with filename completion certainly removes much of the incentive for
short, cryptic filenames.

Also, I personally think that collecting statistics like this should be
done over home directories, not over everything below root,  since many
of the filenames in the root and /usr filesystems are inherited from
the original UNIX system, rather than chosen since. Further, the most
useful place for really large filenames I've seen is in organizing
personal archives, where you can make the name sufficiently descriptive
to make it easier to find later.

For completeness, here's the program I used (a shell script I wrapped
around an awk program someone else posted):

#!/bin/sh

progname=`basename $0`
awkprg=/tmp/$progname$$

trap "rm -f $awkprg;exit 1" 0 1 2 3

cat >$awkprg <<'EOF'
BEGIN {FS = "/"}
{
	l = length($NF)
	c[l]++
	if(l>max) max=l
}
END {
	for(i=1; i<=max; i++) {
		s += c[i]
		printf("%2d %5d %5.2f%% %5d %6.2f%%\n", i, c[i], c[i]/NR*100, s, s/NR*100)
	}
}
EOF

if test $# -eq 0
then
	set '.'
fi

find "$@" -print | awk -f $awkprg

rm -f $awkprg
trap "" 0 1 2 3
exit 0

-Bennett
bet@orion.mc.duke.edu
P.S. Tonight I'm going to run the same thing over everyone's home
directories on our system, as well as over everything from the root
down; I'll post the results tomorrow if all goes well.