Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-l] sample variance versus population variance



On 10/09/2006 07:21 PM, Krishna Chowdary wrote:

Thanks for your response, John. I don't think that's the question I was
asking. I tried to ask my question more explicitly in the rest of the
message you refer to, and tried to define my notation. Please let me know
what I might need to clarify or restate.

My question is not about sample variance vs. population variance. In a
nutshell, it is about whether the standard deviation or the standard
deviation of the mean (of a data set consisting of time measurements) should
be used when propogating uncertainty through a calculation.

Well, I still think that the "nutshell" question is isomorphic to the
sample_variance versus population_variance question. Rereading the
previous message tells me the same thing.

At some very abstract level, asking whether you should propagate the
standard deviation (i.e. sqrt of population variance) or propagate
the standard deviation of the mean (i.e. sqrt of sample variance)
is sorta like asking whether you should buy cat food or dog food.
It all depends. If you have a cat, buy cat food. If you have a
dog, buy dog food.

At the not-so-abstract level, I'm pretty sure that for present
purposes you want to model the population, not the sample. You
can decide for yourself, based on the following simple test:
Suppose you make only one measurement, i.e. the sample size is
N=1. This necessarily means that the sample variance is zero.
Every element of the sample-set sits exactly at the sample mean.
a) Do you really want to report the time as "t ± 0"? That correctly
describes the sample, but is highly misleading as to the population.
b) If you want to describe the population instead, you will need
to divide by sqrt(N-1) rather than sqrt(N), which leads to
reporting the time as "t ± 0/0" in which the uncertainty is an
indeterminate form, correctly reflecting the impossibility of
estimating the population variance from a single sample.

If you decide that modeling the population is what you want, then
sqrt(N-1) is the way to go, for all N >= 1.

For N on the order of 50, the ratio of sqrt(N) to sqrt(N-1) is not
terribly significant, which is why you've never been forced to face
the issue before.

In general, in statistics, you can get away with a lot of dirty tricks
when N is huge compared to the number of adjustable parameters, such
as when N=50 and there is only one adjustable parameter (the sample
mean). In contrast, statistics can be quite interesting and quite
challenging when there are huge numbers of adjustable parameters,
and samples don't grow on trees.

Nitpickers note: I am quite aware that for nonlinear systems,
the familiar notion of counting adjustable parameters must
be replaced by more sophisticated notions, such as entropy
and/or Vapnik-Chervonenkis dimensionality.