# Re: [Rd] [R] computing the variance

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Mon 05 Dec 2005 - 19:25:42 GMT

My view of how to calculate a "variance" is based, not directly on the the "unbiased" issue, but on the following.

Suppose you define a RV X as a single value sampled from a finite population of values X1,...,XN.

The variance of X is (or damn well should be) defined as

Var(X) = E(X^2) - (E(X))^2

So this is the variance of the set of values {X1,...,XN} from that point of view. Similarly I like to preserve the analogy by calling the "variance" of a set of sampled values {x1,...,xn} the quantity caluclated by dividing by (n-1).

This of course links with "unbiased" in that when you use the "1/(n-1)" definition you do have an unbiased estimator of Var(X).

And it ties in nicely with what I call "The Fundamental Formula of the Analysis of Variance" (coincidentally mentioned recently on R-help):

Var[X](X) = E[J](Var[X|J](X|J)) + Var[J](E[X|J](X|J))

for two random variables Y, J (where "E[Y](...)", "E[X|Y](...)" mean "Expectation using the distribution of Y or the conditional distribution of X|Y respectively").

Now, for instance, take a set of samples of sizes n1,...,nk indexed by j=1,...,k and pick a value X at random from the N = n1+...+nk sample values. This gives rise also to a random value of J (sample index) with P(J=j) = nj/N. Now apply the FFAOV and you get the usual breakdown of the sum of squares. And so on.

All that sort of thing is too fundamental to be undermined by making more than the minimum necessary concession (i.e. convenient for those who must) to the "1/n" view of the matter.

I think the confusion (and it does exist) arises (a) from the hisytorical notion that "variance" = the mean squared deviation from the mean (which I prefer to name by its description); together with (b) that the Mean, Variance and such occur very early on in even the least mathematical of introductory statistics courses, and participants in these are invariaboy puzzled by the somewhat mysterious appearnce on stage of "1/(n-1)" -- it can be easier to sweep this under the carpet by saying "it doesn't make any difference in large samples" etc.

> BTW: S+ even has the 'unbiased' argument for cor() where of course
> it really doesn't make any difference (!), and actually I think is
> rather misleading, since the sample correlation is not unbiased
> in almost all cases AFAICS.

Agreed. I wasn't aware of this -- what is the S-plus default? (not that it matters ... ). A simply silly distinction, and possibly a carry-over from the "confusion" described above (e.g. "Since sample correlation is calculated as  Cov(X,Y)/sqrt(Var(X)*Var(Y)), should we use the unbiased or the biased versions of the Cov and the Vars?").

Hmmm.
Best wishes,
Ted.

E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
```Date: 05-Dec-05                                       Time: 19:19:20
------------------------------ XFMail ------------------------------

______________________________________________
```
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Dec 06 06:30:07 2005

This archive was generated by hypermail 2.1.8 : Mon 20 Feb 2006 - 03:21:34 GMT