Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83 v7 ucbtopaz-1.8; site ucbtopaz.CC.Berkeley.ARPA Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!genrad!decvax!ucbvax!ucbtopaz!gbergman From: gbergman@ucbtopaz.CC.Berkeley.ARPA Newsgroups: net.nlang,net.news,net.math.stat Subject: Positive thinking Message-ID: <601@ucbtopaz.CC.Berkeley.ARPA> Date: Tue, 13-Nov-84 18:12:21 EST Article-I.D.: ucbtopaz.601 Posted: Tue Nov 13 18:12:21 1984 Date-Received: Thu, 15-Nov-84 02:28:37 EST Organization: Univ. of Calif., Berkeley CA USA Lines: 69 Xref: genrad net.nlang:2317 net.news:2443 net.math.stat:68 In the 489 text-lines in the net.jobs files currently on our machine, there are only 2 occurrences of the word "not"! Here, is a comparison with results for some other net groups: LINES LINES GROUP WITH OF "not" TEXT net.flame 84 2394 3.51% net.general 8 672 1.19% net.jobs 2 489 0.41% net.math 10 426 2.34% net.news 83 2345 3.54% net.nlang 101 2168 4.65% net.unix-wizards 136 3633 3.77% What started all this was that I was posting a job announcment on someone's behalf, and I wanted to see what phrase was used on the net corresponding to the local "Please do not reply to this account". So I went into the directory with the net.jobs files, did "grep not *", and was surprized to find only two occurrences (neither of which was what I wanted). To get some good comparisons, I eventually set up the alias alias not 'sed -n -f file1 /usr/spool/news/net/\!^/[1-9]*[0-9]|sed -n -f file2|sed -n -f file3' (files1-3 given below). Then the command "not groupname" gave me my raw statistics. In the alias, the file-listing .../[1-9]*[0-9] is to exclude files like uucp and 4bsd under "bugs", which are actually subdirectories and confuse sed. The sed-script file1 removes all header-material (lines beginning Xxxxx:) and all empty lines, the sed-script file2 prints only lines containing the word "not", but ends by giving the total number of lines it has seen, and file3 prints the last line it sees and the number of lines it has seen. This number will be one more than the number of lines containing "not", since it also sees the count-line from the previous script. So one subtracts 1 from the first of the two numbers one gets, and takes the ratio. The method is primitive -- one should really count words rather than lines, since files with short lines will naturally yield a lower frequency of any word, and on the other hand multiple occurrences in one line are not counted. It is curious that in net.general, which had the second lowest ratio, more than half the "not"'s came from one file, the MT XINU bug report announcement with its disclaimers. Without that it might approach net.jobs in sparsity of "not"s. I'm really surprized at the low frequency of this very basic word. I'm not going to pursue this any further -- newsstats@seismo, want to start a new project? (Featuring a different word each month?) Here are the sed scripts used: :::::::::::::: # file1 /^[A-Z][a-z]*:/d # afterthought -- this should have been /^[A-Z][^ ]*:/d /^$/d p :::::::::::::: # file2 s/.*/ & / # above command pads all lines, so next won't miss "not" at either end /[^A-Za-z]not[^a-z]/p $= :::::::::::::: # file3 $= $p :::::::::::::: George Bergman Math, UC Berkeley 94720 USA ...!ucbvax!gbergman%cartan