Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Model Selection Statistics



On 4/12/19 8:58 AM, Paul Nord asked:

Suppose I have some physics data binned into 80 bins. And I have two
models which propose to fit some characteristic of interest and a
non-linear background. One is a simple polynomial. The other is a more
complicated. I can calculate a reduced chi square for each data fit.

What's the best way to compare the two models?

It's complicated. Tremendously complicated. I can pretty
much guarantee that all the concepts and all the techniques
you were taught in school are seriously deficient.

Three of the smartest guys I know were just awarded the Turing
Prize for work on this topic.
https://www.wired.com/story/godfathers-ai-boom-win-computings-highest-honor/
https://www.washingtonpost.com/technology/2019/03/27/artificial-intelligence-pioneers-win-turing-award/
https://www.vox.com/future-perfect/2019/4/4/18294978/ai-turing-award-neural-networks

Here is a stub of a discussion, with some examples that serve
as a warning of what can go wrong. By itself, this is not
very constructive, but it may serve to motivate learning the
advanced techniques:
https://www.av8n.com/physics/data-analysis.htm

Here is a relatively gentle introduction, showing that under
favorable conditions the question *does* have a provably
correct answer:
http://web.cs.iastate.edu/~honavar/pac.pdf


A great deal depends on how much data you have. Note the
contrast:
-- In the real world, data is often expensive and hard to
come by, and truly professional-grade techniques use the
data very efficiently.
-- Sometimes if you are lucky, and/or in a contrived
classroom situation, there is abundant data, and you
can avail yourself of helpful shortcuts.

NOTE: This discussion assumes the purpose of modeling is to
*predict* future data. (If this is not your purpose, please
explain in detail.)

So, if you have tons of data:
Use half of it as the /training set/ (to tune the adjustable
parameters of the models) and then use the other half as the
/testing set/ (to see how well they predict). That gives you
a pretty good handle on what's a good model and what's not
(assuming the testing data is /representative/ of the as-yet-
unseen future data).

If you have a huge number of models to choose from, you can
train them using some data and then choose among them using
more data. This requires three data sets: a training set,
a choosing set, and then a testing set. Let's be clear:
choosing may look like testing, but really it is just a
continuation of training by other means. This is called
p-hacking. To avoid fooling yourself, you need another
independent data set to validate the chosen model.

As Yann likes to say:
données utilisées = données usées
that is,
used data = used-up data

In other words, you can do whatever you want with the
training data, but don't try to re-use testing data, lest
you fall into the p-hacking trap.

Bottom line: there are relatively straightforward ways of
not fooling yourself, if you have abundant data.

If data is scarce, more advanced techniques are required.