Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Question on data analysis



On 05/21/2015 01:50 AM, Savinainen Antti wrote:
a non-zero y-xis intercept may have a physical interpretation or it
may indicate systematic error in measurements or (more realistically)
may be a combination of both.

That's all true.

Also note that the right answer has to model the observed
data including the noise and systematic errors, not just
the underlying ideal noiseless distribution.

Given these, what could be a benefit for the forcing of the trendline
through the origin?

It's a tradeoff. There are always advantages and
disadvantages. Sometimes the net effect of adding
a parameter is beneficial, and sometimes not.

I hate to give a complicated answer to a seemingly
simple question ... but really this is not simple.
The question that was asked is the tip of an iceberg.
If you look at the whole iceberg, you wind up looking
at all of science and engineering.

If we force the fit to go through zero we are modeling
the data with a relatively simple model. If we allow
a nonzero intercept, that's a slightly more complicated
model.

Deciding how many contributions to include in the model
... and deciding /which/ contributions to include ...
is utterly nontrivial. Such questions come up in data
analysis, statistical inference, machine learning,
communication theory, etc. etc. etc.

Suppose we start with a model that is not a straight
line but rather a more general polynomial:

y(x) = a0 + a1 x + a2 x^2 + a3 x^3 [1]

Forcing the model to go through zero is equivalent
to locking a0 to zero. Sometimes that makes sense.

If you know a priori that the right answer is linear,
then you should lock out a2 and all higher coefficients.

If you know a priori that the right answer is an odd
function, then you should lock down all the even-
numbered coefficients, including both a0 and a2

On the other hand, sometimes you really need to include
all the terms in equation [1] ... or perhaps construct
an even more complicated model.

Similar considerations apply when y(x) is a Fourier
series. For example, if you know the right answer is
an odd function, you should lock all the cosine
coefficients to zero.

So the real question is not just how many contributions to
include, but /which/ contributions to include.

To answer such questions requires a _bias/variance tradeoff_.
In particular, note the contrast:
*) In the limit where the data is not very noisy and
you have tons and tons of training data points to be
fitted, you can afford to have a large-ish number of
fittable parameters. This is the win/win scenario:
low bias *and* low variance.
*) If there are only a few nasty noisy training data
points, you need to be super-careful.
-- Using too few parameters results in bias.
Forcing the fit to go through zero when it
shouldn't is an example of bias.
-- Using too many parameters results in variance.
That is, the results of the fit become insanely
sensitive to noise.

This is discussed in more detail, with examples and
diagrams, at
https://www.av8n.com/physics/data-analysis.htm#sec-bias-variance
or equivalently (with less security)
http://www.av8n.com/physics/data-analysis.htm#sec-bias-variance