Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83 v7 ucbtopaz-1.8; site ucbtopaz.CC.Berkeley.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!genrad!decvax!ucbvax!ucbtopaz!gbergman
From: gbergman@ucbtopaz.CC.Berkeley.ARPA
Newsgroups: net.nlang,net.news,net.math.stat
Subject: Positive thinking
Message-ID: <601@ucbtopaz.CC.Berkeley.ARPA>
Date: Tue, 13-Nov-84 18:12:21 EST
Article-I.D.: ucbtopaz.601
Posted: Tue Nov 13 18:12:21 1984
Date-Received: Thu, 15-Nov-84 02:28:37 EST
Organization: Univ. of Calif., Berkeley CA USA
Lines: 69
Xref: genrad net.nlang:2317 net.news:2443 net.math.stat:68

In the 489 text-lines in the net.jobs files currently on our
machine, there are only 2 occurrences of the word "not"!  Here,
is a comparison with results for some other net groups:

		     LINES   LINES
GROUP		      WITH	OF
		     "not"    TEXT
net.flame		 84   2394    3.51%
net.general		  8    672    1.19%
net.jobs		  2    489    0.41%
net.math		 10    426    2.34%
net.news		 83   2345    3.54%
net.nlang		101   2168    4.65%
net.unix-wizards	136   3633    3.77%

     What started all this was that I was posting a job announcment
on someone's behalf, and I wanted to see what phrase was used on
the net corresponding to the local "Please do not reply to this
account".  So I went into the directory with the net.jobs files, did
"grep not *", and was surprized to find only two occurrences (neither
of which was what I wanted).  To get some good comparisons, I eventually
set up the alias

alias not 'sed -n -f file1 /usr/spool/news/net/\!^/[1-9]*[0-9]|sed -n -f file2|sed -n -f file3'

(files1-3 given below).
Then the command "not groupname" gave me my raw statistics.  In the
alias, the file-listing .../[1-9]*[0-9] is to exclude files like
uucp and 4bsd under "bugs", which are actually subdirectories and
confuse sed.  The sed-script file1 removes all header-material (lines
beginning Xxxxx:) and all empty lines, the sed-script file2 prints only
lines containing the word "not", but ends by giving the total number
of lines it has seen, and file3 prints the last line it sees and the
number of lines it has seen.  This number will be one more than the
number of lines containing "not", since it also sees the count-line from
the previous script.  So one subtracts 1 from the first of
the two numbers one gets, and takes the ratio.
     The method is primitive -- one should really count words rather
than lines, since files with short lines will naturally yield a lower
frequency of any word, and on the other hand multiple occurrences in
one line are not counted.  It is curious that in net.general, which had
the second lowest ratio, more than half the "not"'s came from one
file, the MT XINU bug report announcement with its disclaimers.
Without that it might approach net.jobs in sparsity of "not"s.  I'm
really surprized at the low frequency of this very basic word.  I'm not
going to pursue this any further -- newsstats@seismo, want to start a
new project?  (Featuring a different word each month?)
     Here are the sed scripts used:

::::::::::::::
# file1
/^[A-Z][a-z]*:/d
# afterthought -- this should have been /^[A-Z][^ ]*:/d
/^$/d
p
::::::::::::::
# file2
s/.*/ & /
# above command pads all lines, so next won't miss "not" at either end
/[^A-Za-z]not[^a-z]/p
$=
::::::::::::::
# file3
$=
$p
::::::::::::::
			George Bergman
			Math, UC Berkeley 94720 USA
			...!ucbvax!gbergman%cartan