Xref: utzoo news.config:1236 news.admin:5791 Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!decwrl!hplabs!pyramid!epimass!jbuck From: jbuck@epimass.EPI.COM (Joe Buck) Newsgroups: news.config,news.admin Subject: Re: new survey to supplement arbitron. Please run this program. Message-ID: <3219@epimass.EPI.COM> Date: 22 May 89 20:23:37 GMT References: <80@jove.dec.com> <3215@epimass.EPI.COM> <85@jove.dec.com> Reply-To: jbuck@epimass.EPI.COM (Joe Buck) Followup-To: news.config Organization: Entropic Processing, Inc., Cupertino, CA Lines: 61 In article <85@jove.dec.com> reid@decwrl.dec.com (Brian Reid) writes: >Joe, > It's been my experience that there's too much variation in history file >formats out there for programs based on the history file to be very >universal. As far as I know, both the 2.11 news and C news history file format gives the article file name(s) in the third tab-separated field, and the 2.11 format looks the same as it has since 2.10.2 at least. I do not know what format TMNN uses. I know of no other formats. > Also I claim it is not an error to doublecount a crossposted message. Depends on what you intend to accomplish. If all news categories had the same propogation, and if cross-posting were equally prevalent in all categories, it wouldn't make a difference in any attempt to measure the news topology. Unfortunately, this isn't true. Many sites don't get the talk and soc groups; however, crossposting seems more common in those groups. Result: distribution links that send talk groups will be emphasized more than those that send only comp and news. Actually, I think the arbitron statistics are getting increasingly distorted due to this phenomenon. I know there are a large number of "comp"-only sites out there, and I submit that such sites are far less likely to run arbitron. Result: comp is more popular, and rec, soc, and talk less popular than the extrapolated statistics show. Can't prove this, of course. > Also perl is not universal, though it is certainly nice. Perhaps my communication wasn't clear here. I just intended that program as an example of how to process all news articles by scanning the (2.11) history file. I wasn't necessarily saying you should use it instead. > If anybody >is willing to massage that perl script into a format that my automated >processing programs can use, I'd be happy, but I don't want to encourage >people to run software that takes advantage of undocumented features of the >news sytem (e.g. the history file format). That will make it much more >difficult to make future changes to that format. I see your point, but exactly what features of the news system can we say are documented? That the news articles are each in separate files and have all-numeric names (an assumption your program makes, which is not true of notes, for example)? There's a discussion over on news.software.nntp on precisely this point, over a proposal to extend the LIST command to display more files. About the only absolutely standard program you could write would be written to assume nothing but the NNTP protocol and access all articles that way (slow, slow, slow). Another way of eliminating double-counted crossposts using your method is to check the Newsgroups: header and not count the article if the first newsgroup doesn't match the file name (so articles are only counted in the first group posted to). -- -- Joe Buck jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck