Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] Model Selection Statistics



John,

Actually, it's not about prediction. This is an attempt to find the best
value for a physical parameter based on spectra from a particle physics
detector. The histogram shows a clear peak. But the background changes
with beam energy. The underlying physical phenomena that produce the
background are complex. In fact, one should probably try to reproduce
these spectra with Monte Carlo methods. Then the background events and the
"real" events would be easily tagged and we would know how to model the
background. But even that has a lot of pitfalls because one is really
tuning the Monte Carlo to fit the data. And we say that it agrees when it
agrees because it agrees. (Probably approximately correct, as they say in
AI.)

Let's try a less complex question and see if the answer is less complex.
Suppose I have a spectra with a peak and I'd like to find the area under
the peak and the value of the center channel. It is clear that the peak
sits on some background. How would I choose between a linear background
and a background that used a polynomial with N terms?

Paul


On Fri, Apr 12, 2019 at 11:54 AM John Denker via Phys-l <
phys-l@mail.phys-l.org> wrote:

On 4/12/19 8:58 AM, Paul Nord asked:

Suppose I have some physics data binned into 80 bins. And I have two
models which propose to fit some characteristic of interest and a
non-linear background. One is a simple polynomial. The other is a more
complicated. I can calculate a reduced chi square for each data fit.

What's the best way to compare the two models?

It's complicated. Tremendously complicated. I can pretty
much guarantee that all the concepts and all the techniques
you were taught in school are seriously deficient.

Three of the smartest guys I know were just awarded the Turing
Prize for work on this topic.

https://www.wired.com/story/godfathers-ai-boom-win-computings-highest-honor/

https://www.washingtonpost.com/technology/2019/03/27/artificial-intelligence-pioneers-win-turing-award/

https://www.vox.com/future-perfect/2019/4/4/18294978/ai-turing-award-neural-networks

Here is a stub of a discussion, with some examples that serve
as a warning of what can go wrong. By itself, this is not
very constructive, but it may serve to motivate learning the
advanced techniques:
https://www.av8n.com/physics/data-analysis.htm

Here is a relatively gentle introduction, showing that under
favorable conditions the question *does* have a provably
correct answer:
http://web.cs.iastate.edu/~honavar/pac.pdf


A great deal depends on how much data you have. Note the
contrast:
-- In the real world, data is often expensive and hard to
come by, and truly professional-grade techniques use the
data very efficiently.
-- Sometimes if you are lucky, and/or in a contrived
classroom situation, there is abundant data, and you
can avail yourself of helpful shortcuts.

NOTE: This discussion assumes the purpose of modeling is to
*predict* future data. (If this is not your purpose, please
explain in detail.)

So, if you have tons of data:
Use half of it as the /training set/ (to tune the adjustable
parameters of the models) and then use the other half as the
/testing set/ (to see how well they predict). That gives you
a pretty good handle on what's a good model and what's not
(assuming the testing data is /representative/ of the as-yet-
unseen future data).

If you have a huge number of models to choose from, you can
train them using some data and then choose among them using
more data. This requires three data sets: a training set,
a choosing set, and then a testing set. Let's be clear:
choosing may look like testing, but really it is just a
continuation of training by other means. This is called
p-hacking. To avoid fooling yourself, you need another
independent data set to validate the chosen model.

As Yann likes to say:
données utilisées = données usées
that is,
used data = used-up data

In other words, you can do whatever you want with the
training data, but don't try to re-use testing data, lest
you fall into the p-hacking trap.

Bottom line: there are relatively straightforward ways of
not fooling yourself, if you have abundant data.

If data is scarce, more advanced techniques are required.
_______________________________________________
Forum for Physics Educators
Phys-l@mail.phys-l.org
http://www.phys-l.org/mailman/listinfo/phys-l