Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] From a Math Prof (physics BS major) at my institution ( math challenge)



Correction. The expected frequency for 35 bins is, of course, 3 not 1 for
105 numbers. Used 3 in the analysis, but incorrectly wrote 1 in the email.
Senior moment!

-----Original Message-----
From: Phys-l [mailto:phys-l-bounces@phys-l.org] On Behalf Of Donald Polvani
Sent: Tuesday, February 18, 2014 8:17 PM
To: Phys-L@Phys-L.org
Subject: Re: [Phys-L] From a Math Prof (physics BS major) at my institution
( math challenge)

On 2/18/14 John Denker at 6:16 pm wrote:

"So far, the only statistic that looks out of whack to me is the scarcity of
numbers ending in 0 (i.e. numbers equal to 0 mod 10) in the first set. OTOH
people have looked at a lot of statistics, and if you look at enough, sooner
or later you will find /something/ that is out of whack ...
even if the data is truly random."

I've now been able to load the data into Excel and do a little better
analysis than my first attempt. I did a chi-squared test for goodness of
fit under the hypothesis that the two lists came from a uniform random
distribution between the numbers 1 and 35. I divided each list into 5, 35,
and finally 10 histogram bins. Since each list has 105 numbers, for 5 bins
the expected frequency in each bin is 21, for 35 bins it is 1, and for 10
bins it is 10.5. I computed the 5 bin results by hand and, once Excel
confirmed my results, switched to Excel's CHISQ.TEST function to get the
probability that the data did come from a uniform random distribution. Here
is what I found:

1) 5 Bin Results: Chi-square probability for List 1 is 0.86; for List 2 it
is 0.15 (ratio of probabilities is 5.73)
2) 35 Bin Results: Chi-square probability for List 1 is 0.53; for List 2 it
is 0.38 (ratio of probabilities is 1.39)
3) 10 Bin Results: Chi-square probability for List 1 is 0.51; for List 2 it
is 0.046 (ratio of probabilities is 11.1)

As I mentioned earlier, I'm not that familiar with tests for the randomness
of distributions, but a little Wikipedia research revealed that the
chi-squared test is not accurate if the expected frequencies are below 10
(some say 5 is OK). So the 35 bin results (with expected frequencies of 1)
are suspect. I reasoned that I would like to increase the number of degrees
of freedom as much as possible before the expected frequency lowered to 10.
Therefore, I tried the 10 bin analysis (with an expected frequency of 10.5),
which did lead to the highest chi-square probability ratio for List 1
compared to List 2.

I have to admit, that if I was still working, and this problem came across
my desk (with limited time to do analysis, as was always the case), I would
have chosen List 1 as the random data (with the caveat that this was only
"probably so").

I'm wondering if the math professor "cherry-picked" some student data by
using a set (List 1) which he knew was going to look a lot like random data?
I also realize that extreme events, even if highly unlikely, do occur
(witness the 2009 financial crisis), or the highly unlikely poker hands
mentioned here recently. On a personal note, last Thursday, out of the
hundreds of thousands of people living in Maryland, we were among the lucky
3700 to have our power go out (fortunately, only for 2 hours).



_______________________________________________
Forum for Physics Educators
Phys-l@phys-l.org
http://www.phys-l.org/mailman/listinfo/phys-l