Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Question on data analysis



On 05/21/2015 01:50 AM, Savinainen Antti wrote:

what could be a benefit for the forcing of the trendline through the
origin?

OK, good question. Here's the short answer:

Have you ever taken a simple average? In particular,
suppose we want to know the density of some substance.
We measure it three times, and take the mean.

That is a /one parameter/ fit. There is only one
number that we ascertain from the data, namely the
average.

Now it turns out that averaging the density is
entirely equivalent to plotting the mass versus
volume and doing a /one parameter/ fit, that is,
fitting a straight line through the origin.
There is only one number that we ascertain from
the data, namely the slope, which is the density.

One must take care to perform a properly
/weighted/ fit, but that is always required
(even for a simple average) and should go
without saying.

This is discussed in more detail, with pictures,
at
https://www.av8n.com/physics/linear-least-squares.htm#sec-density
or equivalently (with less security)
http://www.av8n.com/physics/linear-least-squares.htm#sec-density

See also the longer answer that I gave earlier today.

On 05/21/2015 09:26 AM, Dan Beeker wrote:

Why would you force it to go through zero?

That's a good question. I know it was intended as a rhetorical
question, but I'm going to answer it anyway.

One should always honor their data.

Love, honor, cherish, respect.... but that does *NOT*
include overfitting.

There is no such thing as data analysis based solely on
statistics, based solely on the data. It is formally
provably necessary to start with what you know about
the physics of the situation, and then add statistics.

Why would you force it to go through zero?

Good question.

That's backwards.

Wrong answer.

If you have a model that predicts your data should go
through zero, then one should look for reasons why it doesn't.

In a density measurement, the mass versus volume curve goes
through zero. There is no reason why it shouldn't. This
is only one example among many. Infinitely many.

Of course, there are also infinitely many examples of
the other kind, where you need more than one fitting
parameter.

There's a lot more that could be said about this, but the
bottom line is that more parameters is *not* always better.
Overfitting is a bad thing. The bias/variance tradeoff is
a tradeoff, and dogmatically favoring one extreme or the
other is a guaranteed losing strategy.

Otherwise, why bother acquiring the data in the first place?

Maybe because we wanted to know the density.