Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!watmath!clyde!cbosgd!ihnp4!hplabs!glacier!reid
From: reid@glacier.ARPA (Brian Reid)
Newsgroups: net.news,net.news.group
Subject: Re: polling and statistics
Message-ID: <5476@glacier.ARPA>
Date: Wed, 19-Mar-86 03:00:41 EST
Article-I.D.: glacier.5476
Posted: Wed Mar 19 03:00:41 1986
Date-Received: Fri, 21-Mar-86 04:58:19 EST
References: <1953@saber.UUCP> <896@vortex.UUCP>
Reply-To: reid@glacier.UUCP (Brian Reid)
Organization: Stanford University, Computer Systems Lab
Lines: 57
Xref: watmath net.news:4683 net.news.group:5260


Several people have brought up the issue of self-selected samples in private
mail to me; I've answered individually. Since Lauren posted this note I feel
it's time to post an explanation.

Summary: this is not a self-selected poll. It has a certain self-selected
flavor to it, though an indirect one, but the classic problems of
self-selected respondents do not apply here. Unfortunately, there is no way
to tell how much the self-select factor is affecting things, but I am
confident that its effect is small, and I am certain that its effect is not
dominating the data.

The reason for this is that I don't need a response from every user, I only
need a response from one user per site. Naturally the poll selects in favor
of sites that have users who are willing to respond, but if the average
population of a site is high enough, then the results are reasonably
unbiased. It is therefore biased in favor of larger sites, but that turns
out to be OK because the larger sites are where most of the readers are.
Sites "well", "ritcv", "gitpyr", and "cod" have 700 netnews readers among
the four of them (1 service bureau, 2 university machines, 1 government lab).
That (and dozens of other sites like them) totally dominate the 5-user
Unisoft 68000 machines whose 5 users are all too busy to respond.

I've been discussing this issue at great length with various statisticians
around Stanford, and although everybody agrees that USENET is far too
complex and amorphous for quantitative analysis of the quality done for,
say, television or presidential elections,  they point out that I am polling
for ratios and not for absolute counts, which tends to compensate for a
number of different kinds of bias.

My own calculations (I'm reasonably well trained in statistics) lead me to
believe that the "what percentage of the population reads this group"
columns are accurate to within about 25% (i.e. a share of 10% could be
12.5% or it could be 8%), and that the "how many people read this group,
worldwide" column is accurate to within 100% (i.e. a figure of 5000 people
reading it could be 2500 or it could be 10000). The ratios are a lot more
accurate than the absolute numbers, and the ratios between absolute numbers
are probably even more accurate.

The "Dollars per reader per month" column should perhaps have been labeled
"cost per reader per month". It's true that there is no guarantee that those
numbers are anything resembling dollars, but it is also true that whatever
units they are in, they are the same for all newsgroups and therefore can be
compared in ratio. In other words, if net.religion.christian (the most
expensive group per reader) costs 30.00 units per reader, and net.cooks
costs 1.00 unit per reader, then it is quite true that
net.religion.christian is 30 times more expensive than net.cooks FOR THE
SAMPLED POPULATION. Whether it is 30 times more expensive for the whole
network, or only 25 times more expensive, or 35 times more expensive, is
determined with the same accuracy as the readership ratios, which I believe
to be 25%.

The way to improve the data, of course, is to take more of it. Send in your
arbitron results to netsurvey@glacier.
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA