Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!watmath!clyde!cbosgd!ihnp4!hplabs!glacier!reid From: reid@glacier.ARPA (Brian Reid) Newsgroups: net.news,net.news.group Subject: Re: polling and statistics Message-ID: <5476@glacier.ARPA> Date: Wed, 19-Mar-86 03:00:41 EST Article-I.D.: glacier.5476 Posted: Wed Mar 19 03:00:41 1986 Date-Received: Fri, 21-Mar-86 04:58:19 EST References: <1953@saber.UUCP> <896@vortex.UUCP> Reply-To: reid@glacier.UUCP (Brian Reid) Organization: Stanford University, Computer Systems Lab Lines: 57 Xref: watmath net.news:4683 net.news.group:5260 Several people have brought up the issue of self-selected samples in private mail to me; I've answered individually. Since Lauren posted this note I feel it's time to post an explanation. Summary: this is not a self-selected poll. It has a certain self-selected flavor to it, though an indirect one, but the classic problems of self-selected respondents do not apply here. Unfortunately, there is no way to tell how much the self-select factor is affecting things, but I am confident that its effect is small, and I am certain that its effect is not dominating the data. The reason for this is that I don't need a response from every user, I only need a response from one user per site. Naturally the poll selects in favor of sites that have users who are willing to respond, but if the average population of a site is high enough, then the results are reasonably unbiased. It is therefore biased in favor of larger sites, but that turns out to be OK because the larger sites are where most of the readers are. Sites "well", "ritcv", "gitpyr", and "cod" have 700 netnews readers among the four of them (1 service bureau, 2 university machines, 1 government lab). That (and dozens of other sites like them) totally dominate the 5-user Unisoft 68000 machines whose 5 users are all too busy to respond. I've been discussing this issue at great length with various statisticians around Stanford, and although everybody agrees that USENET is far too complex and amorphous for quantitative analysis of the quality done for, say, television or presidential elections, they point out that I am polling for ratios and not for absolute counts, which tends to compensate for a number of different kinds of bias. My own calculations (I'm reasonably well trained in statistics) lead me to believe that the "what percentage of the population reads this group" columns are accurate to within about 25% (i.e. a share of 10% could be 12.5% or it could be 8%), and that the "how many people read this group, worldwide" column is accurate to within 100% (i.e. a figure of 5000 people reading it could be 2500 or it could be 10000). The ratios are a lot more accurate than the absolute numbers, and the ratios between absolute numbers are probably even more accurate. The "Dollars per reader per month" column should perhaps have been labeled "cost per reader per month". It's true that there is no guarantee that those numbers are anything resembling dollars, but it is also true that whatever units they are in, they are the same for all newsgroups and therefore can be compared in ratio. In other words, if net.religion.christian (the most expensive group per reader) costs 30.00 units per reader, and net.cooks costs 1.00 unit per reader, then it is quite true that net.religion.christian is 30 times more expensive than net.cooks FOR THE SAMPLED POPULATION. Whether it is 30 times more expensive for the whole network, or only 25 times more expensive, or 35 times more expensive, is determined with the same accuracy as the readership ratios, which I believe to be 25%. The way to improve the data, of course, is to take more of it. Send in your arbitron results to netsurvey@glacier. -- Brian Reid decwrl!glacier!reid Stanford reid@SU-Glacier.ARPA