Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Model Selection Statistics



On 4/15/19 8:16 AM, Paul Nord wrote:

In fact, one should probably try to reproduce
these spectra with Monte Carlo methods.

Agreed! Yes!

In particular, if you have two models, check all four possibilities:
-- Use the two models to generate MC data. (One set
apiece i.e. two datasets.)
-- Test each model by fitting against each dataset. (Four
tests.) This will let you see whether data from one model
can masquerade as data from the other model.

The answers will be strongly dependent on how many data points
there are, and on how much noise there is on the data.

Insofar as these consistency checks can be performed without
relying on the actual laboratory data, they don't run the
risk of p-hacking disease.

But even that has a lot of pitfalls because one is really
tuning the Monte Carlo to fit the data. And we say that it agrees when it
agrees because it agrees.

That sounds right, but it's somewhat vague. Let me lay out
a numerical example that shows how I understand such things;
if this is off the mark please clarify: Suppose there is a
two-parameter model, y = m x + b, with adjustable parameters
m and b. Using plain old high-school fitting techniques,
we find m=1.234 and b=5.67. So far so good.

Now, p-hacking disease arises if you use "model selection" to
fit to the *ONE*-parameter model y = m x + 5.67, where that
model was "selected" from a large set of similar models. I
mention this to make the point that model selection is just
a continuation of tuning (aka fitting) by other means.

This can be quantified in terms of entropy and information.
If you have only two models to choose from, that contributes
only one bit of entropy to the difficulty of the problem,
which is probably not noticeable unless the number of data
points is very small. Conversely, if you start rummaging
through some enormous number (N) of models, that contributes
something like log(N) to the entropy and could significantly
drive up the amount of data (i.e. information) required to
obtain meaningful results.

Also beware that the high-school notion of counting the
number of parameters is not reliable. Given a sufficiently
tricky model, you can get one parameter to do the work of two;
recall that Cantor showed you can map a 2D space onto a 1D
space. Messy, but doable. For that matter, you can use
a single long decimal number to encode a bit-string, and
interpret it as a program for a Turing-complete computer,
and produce absolutely any computable result from a single
parameter!

This can be quantified in terms of the Vapnik-Chernovenkis
dimensionality. A simple polynomial with N parameters has
a VC dimensionality of N, which agrees with high-school
intuition ... but there are other things that look almost
as simple, such as a sine function with one parameter, that
have /infinite/ VC dimensionality. So a high-frequency
sine wave is, in principle, just as badly unconstrained as
a computer program would be.
https://www.av8n.com/physics/data-analysis.htm#sec-vc-dimension

As an easy-to-understand approximation to this idea: rather
than just counting the number of parameters, count the number
of digits needed to specify the parameters. If one or more
parameters require a huge number of finicky digits, watch out.

This is an attempt to find the best
value for a physical parameter based on spectra from a particle physics
detector.

OK.

Actually, it's not about prediction.

I'm surprised, and not 100% convinced. Are we perhaps using
the word "prediction" in different ways? In particular, what
are you going to do with the fitted parameter? If you publish
it, what are your readers going to do with it? AFAICT real-
world science is all about making predictions. If the parameter
is not going to be used for something, what's the point?

This idea sometimes gets lost in cookbook classroom
experiments, where the only goal is to get a number
that agrees with the expected number, but IMHO that
is not good science and not good pedagogy.