Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] From a Math Prof (physics BS major) at my institution ( math challenge)



On 02/22/2014 03:40 PM, Derek McKenzie wrote:

d) Decide on the statistical test you are going to do BEFORE peaking at the data.

That's really good advice.

At the next level of detail: The "test you are going to do" can include
any finite number of subtests. In contrast, if you allow an endless
search through the space of all possible tests and subtests, you wind
up with a witch hunt.

This can be connected to your intuition about fitting data with polynomials.
If there are too few data points and too many adjustable parameters, the
fit is guaranteed to be unstable. Similarly if there are too few data
points and too many subtests, the results are guaranteed to be unstable.

All this can be quantified in terms of the Vapnik-Chernovenkis dimensionality.
This generalizes the notion of "number of fitting parameters" insofar as a
polynomial with N parameters has VC dimension equal to N. A list of M
simple subtests has VC dimension equal to log(M).

Beware that just counting the number of parameters does not generally
give the right answer, for things other than polynomials. In particular,
a sine wave with just *one* adjustable parameter has an *infinite* VC
dimensionality. I kid you not. You can get some clue as to how this
comes about here:
http://www.av8n.com/physics/thinking.htm#sec-omit

The basic theory goes like this: the sine can compute the modulo
function. This means that we can pick out any digit we want from
the decimal expansion of the parameter, including digits waaaaay to
the right of the decimal point. There are an unlimited number of
such digits.

The VC dimension is related to another quantity you may have heard of,
namely the /entropy/.

=============

This is also related to the seersucker law, which states that for every
seer there is a sucker.

Suppose I "predict" that the Dow Jones will go up on Monday. I send
out 25,000 emails documenting my prediction ... and another 25,000
documenting the exact opposite prediction. Then I "predict" that
the Dow will go down the next day, and send emails to whichever
half of the cohort saw a correct prediction on Monday. After a
couple of weeks, I am left with 25 people who have seen 10 correct
"predictions" in a row. They are convinced this cannot possibly be
a coincidence. They think I am a seer who can predict the market.
They send me their money so I can "invest" it for them.........

Bottom line: There is an art and even a science of making statistical
inferences. Most people are really bad at it. Typical statistics
books present a lot of nitty-gritty techniques and weird terminology
... but do not do a good job of explaining the big picture.