Chronology	Current Month	Current Thread	Current Date
[Year List] [Month List (current year)]	[Date Index] [Thread Index]	[Thread Prev] [Thread Next]	[Date Prev] [Date Next]

[Phys-L] Re: Average earlier or average later?

From: John Denker <jsd@AV8N.COM>
Date: Fri, 09 Sep 2005 13:56:25 -0400

ludwik kowalski wrote:

1) Suppose the true relation is strongly nonlinear, such as y=A*x^10.
2) Suppose the the distribution of experimentally measured x is
gaussian (due to random errors of measurements).
3) The distribution of the corresponding y values (calculated from
individual x) will not be Gaussian; it will be skewed.

That's asking the right sort of question. It is important to
consider scenarios of that sort.

For that reason averaging at the level of what is experimentally
measured seems to be preferable.

Beware: it is easier to prove that late averaging does not make
sense, and relatively hard to prove that early averaging does make
sense. It may be that neither makes sense.

If I change Ludwik's scenario just a little bit, with a different
nonlinearity and a slightly different noise model, averaging the
raw data is a disaster, as can be seen in the following example.

Suppose you have some gas in a pressure cell. The amount of gas
is decaying exponentially as a function of time. You observe the
decay using a pressure transducer. Alas, the pressure is very, very
small, so there is quite a bit of noise in each pressure reading.
The raw data looks like this:
http://www.av8n.com/physics/img48/noisy-decay-raw.png

The naive student would like to reduce this to a problem previously
solved. Since the expected functional form is exponential, he gets
the bright idea of plotting the data on semi-log paper, and fitting
a straight line to that representation of the data. When he tries
it, the plot looks like this:
http://www.av8n.com/physics/img48/noisy-decay-log.png

Alas, this has all sorts of problems. For starters, because of the
noise in the raw data, he is required to take the logarithm of some
numbers that are very small or even negative. It's hard to plot
imaginary numbers on a plot like this ... but I didn't want to just
throw away any of the data, so clustered in the lower-right corner
are a bunch of downward-pointing triangles. These are stand-ins for
some points that couldn't be properly plotted, either because they
were off-scale, or because they were outright imaginary.

There is no sane way to fit a straight line to data like this. If
you disregard the off-scale data, the fit will have systematic errors,
and if you try to account for the off-scale data, it is insanely
laborious. For what it's worth, the line corresponding to the
right answer is shown here:
http://www.av8n.com/physics/img48/noisy-decay-log-line.png
which doesn't really "look" like a good fit, which shows that this
whole approach is a lose/lose proposition.

Next our hero tries a naive version of "early averaging" that is
partially in accord with Ludwil's recommendation. Since there are
400 data points, he groups them in to four groups. He produces
four cooked data points, each of which is the average of 100 raw
data points. Even better, he can take a moving average (boxcar
average) to produce a larger number of cooked data points, each
of which is the average of 100 raw data points. (There will now
be some correlation between the cooked data points, but that's
not fatal.) The data looks like this:
http://www.av8n.com/physics/img48/noisy-decay-early-avg.png
where I have shown you something the student doesn't know, namely
the right answer (the black curve). You can see that the data
points are systematically high, especially at early times. That's
because averaging is tantamount to fitting a straight line to the
data ... but alas the data is not supposed to be straight. The
data has upward curvature everywhere, especially at early times,
and therefore any straight line fitted to the data will (at the
midpoint of the line) lie above the data. The error is substantial;
the standard deviation of the cooked points is about 0.01, and at
early times they are high by more than 0.03 units.

This systematic error is the disaster I advertised in the introduction.
It is totally unacceptable.

Ah, you might say he's just doing the wrong sort of averaging.
He should just re-take the data over and over again, so that he
has multiple ordinates for each abscissa along the time axis.
Well, that's more easily said than done. Imagine that it takes
four months and four million dollars for one data-taking run.
It is important to do the best possible job of analyzing *this*
data, and wishing for more data is not practical.

Anyway, finally, here is one good way of analyzing the data (not
necessarily the only good way). First I fit an exponential to
the raw data. In this case, the widely used "nonlinear least
squares" fitting procedure is good enough (although beware, there
are innumerable cases where people use it even though it isn't
appropriate). Then I calculate the residuals, as shown by the
blue symbols in
http://www.av8n.com/physics/img48/noisy-decay-residuals.png

Since the residuals are not "supposed" to have any curvature, and
in any case they have vastly less curvature than the raw data,
I can average them. I can then plot the cooked residuals by
themselves, or attach them to the fitted curve like this:
http://www.av8n.com/physics/img48/noisy-decay-cooked-residuals.png

It is important to make this plot, and scrutinize it, to see if
there is anything systematically wrong with the fit.

=================

Bottom line: As I said earlier: Averaging is a good way of estimating
the mean. Sometimes that's exactly what you want ... but sometimes
it isn't. It's hard to come up with a concise statement of when
averaging makes sense and when it doesn't ... I strongly suspect that
any concise statement would be unreliable.

Data analysis is hard. In general it requires having a good mental
model of what the relevant interesting physics is doing, *plus* a
good mental model of what the noise is doing.
_______________________________________________
Phys-L mailing list
Phys-L@electron.physics.buffalo.edu
https://www.physics.buffalo.edu/mailman/listinfo/phys-l

References:
- [Phys-L] Re: Average earlier or average later?
  - From: ludwik kowalski <kowalskil@MAIL.MONTCLAIR.EDU>

Prev by Date: [Phys-L] Re: "moving clock runs slower" (not)
Next by Date: [Phys-L] Re: "moving clock runs slower" (not)
Previous by thread: [Phys-L] Re: Average earlier or average later?
Next by thread: [Phys-L] Re: Average earlier or average later?
Index(es):
- Date
- Thread