Newsgroups: comp.lang.apl Path: utzoo!utgpu!news-server.csri.toronto.edu!torsqnt!geac!itcyyz!yrloc!hui From: hui@yrloc.ipsa.reuter.COM (Roger Hui) Subject: Statistical Functions in J Message-ID: <1991May12.145907.19563@yrloc.ipsa.reuter.COM> Reply-To: hui@yrloc.ipsa.reuter.COM (Roger Hui) Organization: Iverson Software Inc. Date: Sun, 12 May 91 14:59:07 GMT The definitions of mean (m.), normalize (n.), and spread (s.) have changed in response to a detailed critique by Professor Fraser Jackson of the Victoria University of Wellington (uunet!matai.vuw.ac.nz!jackson). Professor Jackson's comments are as follows (minor editing by me): -------------------------------------------------------------------- The concept of expectation is central to both probability and statistics. Indeed it is quite possible to axiomatise probability theory in terms of expectations rather than probabilities. One didactic advantage of doing so is that the whole apparatus of measure theory essential for an axiomatisation based on probabilities can be introduced much later. From the classical viewpoint, P. Whittle, Probability, Penguin, 1970? (To appear in a revised edition by Springer Verlag) gives an illustration of this alternative approach in a text designed to cover the subject from first principles. He illustrates numerous results from quantum theory to investment analysis where the expectation is the natural function to use. From the subjective probability perspective, De Finneti's classic published by Wiley also provides an axiomatisation based on expectation. m. I believe therefore that the function m. should be the expectation. This would give m. y as the usual mean x m. y as the mean formed with weights given by the vector x%+/x Hence the monadic case would just be 1 m. y This definitions permits a very simple treatment of something which appears at the beginning of virtually all statistical texts - frequency distributions. For a frequency distribution with frequencies f and values y this gives (in expressions which I am sure can be written more simply) the mean as f m. y deviations from the mean y - f m. y the mean squared deviation f m. *: y - f m. y In doing calculations with density functions the weighted mean is always required to find statistics of the distribution, and if there is any transformation of the variable then that also requires use of a weighted mean calculation to find parameters of the transformed distribution. Since variable transformation is often required the function proves very useful in numerical studies. The expectation of a function g of y under a density p y becomes just (p y) m. g y or better (p m. g) y Using an obvious notation, the two commonest price index numbers become laspeyres (p0*q0) m. p1%p0 paasche (p0*q1) m. p1%p0 In the Dictionary, the concept of using the left argument for the degrees of freedom is introduced. The degrees of freedom are generally only important in second moment calculations with unweighted data. It seems to me that is giving a special case unwarranted importance. The degrees of freedom are simply corrected for using a multiplier. In the simplest case it is just (#%(_1&+&#)) y , but a whole range of other cases are equally simply dealt with. I note that Table 1 uses moment as the description of the dyadic case of m. If that represents an earlier (or later idea) it again seems to me inappropriate. The moments are highly specialised means and best dealt with using the mean of some transformation of the variable, e.g. in the monadic case m. (y - m. y)^4 with the dyadic case f m. ( y - f m. y)^4 n. I would define n. to be consistent with m. Hence the monadic case would be with divisor #y, and the dyadic case would give normalised deviations from a weighted mean. Again if there is a need to adjust degrees of freedom for a particular case I would prefer to see a multiplier used since in many cases items have non-integer weights. Alternatively, a similar procedure could be adopted as with the fit conjunction, this time however using a degrees of freedom conjunction. That solution could give I believe the features you want with the added flexibility and power permitted by generalising to weighted means. The DF conjunction would then only be needed in cases where the data were individual observations and the adjustment was required for a particular statistic or calculation. The dictionary definition of the dyadic case seems to me to give an expression almost never used since apart from squared deviations we seldom use a mean with divisor less than #y. In the squared deviations case when it is used it is not then common to normalise the squared values in any calculations I have had occasion to use. s. I believe a case can be made here for two functions. The first would give a measure of spread for the variables considered individually, and be defined across arbitrary arrays. For that I would prefer either the sum of squared deviations or the mean squared deviation on the grounds noted below. The second is a function which generates information about the covariation of the variables in each item of the data object. If there is to be just one function, I would favour this, in view of the importance of many areas of multi-variate analysis. The standard deviation is a useful measure but very specialised to the univariate case. Further the standard deviation is not an additive measure like the variance (using it now with divisor #y) and the sum of squares. Variances can be simply combined, using the m. function above. (For a population the variance is just the weighted mean of the variances if the populations consists of subsets with different variances.) The standard deviation needs to be squared before it can be combined. As soon as you move to multivariate statistics it is the covariance matrix which is the basic tool, not the correlation matrix or the variable standard deviations. In multivariate analyses it has been my experience that the speed of formation of a cross product matrix X'X is crucial. This is so important that it deserves to be a special function and programmed for speed of execution. So my primitive function would be to form (|:y) +/ . * y in the monadic case. A case can be made for using the mean value of the outer products of the items of y for consistency with the previous definitions though I do not find that compelling. The avoidance of centering within the function is deliberate. Nearly all of the multivariate texts and the works on regression express the theory in terms of the original observation units. The case where the data are deviations from the mean then becomes a special case. With the m. primitive function it is easily treated as just that. The conventional covariance matrix then is (#y)%~(|:yd) +/ . * yd =. y - m. y but for the occasions where that is required, if s. is the cross product verb it is simple enough to enter (#y)%~ s. y - m. y Using the normalised values we obtain the correlation matrix as (#y)%~ s. n. y The choice of dyadic form is more difficult. I believe that consistency with the previous definitions would lead to choice of the dyadic form x s. y as weighting the items in y by %:x where x is a vector of weights. In this way each item in y is given a weight in the crossproduct matrix equal to the corresponding item in x. The weights are not normalised to sum to 1 since in the applications a range of normalisations will be used. This provides a very powerful primitive function, encompassing all of the weighted regression models, and the powerful tools of generalised least squares as outlined by McCullough and Nelder in 'Generalised Linear Models' (2nd ed) Chapman and Hall, 1989. It would also provide a primitive function for many methods based on weighted observations in Econometrics, in particular methods due to H. White of constructing consistent estimators in the common case when there are heteroscedastic errors. It would also be a core function in those methods of non-linear optimisation designed to find the extreme of a likelihood function. An alternative would be to define it as forming the expression (|:x)+/ . * y since this is also often required. This would be of value if the code for it was much faster than the using the general expression above. I prefer the definition which preserves consistency with the concept of weighted values. It is a more natural extension consistent with the usage in the definitions of m. and n. proposed. By introducing a weighting function it is signficantly generalising the monadic operation. Second, and less important from a language design persepective (but perhaps as important for potential users), it is rather messy to program otherwise, and exploiting the special structure of the function could lead to important speed gains in execution. I recognise that speed of execution has not been a primary factor in your design, but it is an important consideration, especially with the data intensive calculations likely in multi-variate studies. If the functions selected contribute both to ease of statement of the problem, and fast execution then they seem to me to have a strong case for inclusion. Looking at these three functions now as a group, and endeavouring to put them into Dictionary style -- m. MEAN (_) m. is the mean value, and m. y is 1 m. y MEAN (_ _) x m. y is the weighted mean of the values of y with weights given by x. The weights will be normalised to sum to one in forming the weighted mean. (It may be desirable to restrict the dimensionality of the arguments to (1 _) though the general case is often useful.) n. NORMALIZE (_) The result of n. y is y translated to have a mean of zero and a mean squared deviation of 1. It is defined as 1 n. y . NORMALIZE (_ _) x n. y provides normalised values of y which when applied with weights x sum to zero and have mean squared deviation equal to 1. The weights x will be normalised to sum to 1. The result of x n. y is therefore (y-x m. y)% %:x m.*:y-x m. y Thus x m. x n. y is 0 and x m. *: x n. y is 1. In the case where the weights are frequencies and it is desired to standardise so that variance is 1 the values must be multiplied by %:(+/x)%(+/x)-1 (Again a restriction on the arguments to (1 _) could be considered.) s. SPREAD (2) s. y is the crossproduct matrix of the values in y and equal to (|:y)+/ . * y . Consistent with the usage of the other statistical functions, s. y is 1 s. y . The covariance matrix is just (#y)%~ s. y - m. y and the correlation matrix is (#y)%~ s. n. y SPREAD (1 2) x s. y is the crossproduct matrix s. (%:x)*"0 1 y . Note that the weights are not normalised to sum to one in this case. The dyadic form weights the outer product of each item of y by the corresponding item in x. ---------------------------------------------------------- The new definitions have been implemented in J 3.1. They are modelled thus: rx =. ($@$@[ <@}. $@]) $&> [ ry =. ($@$@] <@}. $@[) $&> ] m2 =. (rx * ry) %&(+/) rx m1 =. +/ % # dev =. ry - m. rms =. [ %:@m. *:@dev n2 =. dev % rms n1 =. 1&n2 s1 =. |: +/ . * ] s2 =. s1@(%:@[ *"0 1 ]) m. is m1 : m2 n. is n1 : n2 s. is s1 : s2 " 2 1 2 ----------------------------------------------------------------- Roger Hui Iverson Software Inc., 33 Major Street, Toronto, Ontario M5S 2K9 (416) 925 6096