Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] From a Math Prof (physics BS major) at my institution ( math challenge)




In a particular matrix math package called Matlab we can quickly compute correlations (among many other operations) : specifically -
R = corrcoef(X) returns a matrix R of correlation coefficients calculated from an input matrix X whose rows are observations and whose columns are variables.
[R,P]=corrcoef(...) also returns P, a matrix of p-values for testing the hypothesis of no correlation. Each p-value is the probability of getting a correlation as large as the observed value by random chance, when the true correlation is zero. If P(i,j) is small, say less than 0.05, then the correlation R(i,j) is significant.

The p-value is computed by transforming the correlation to create a t statistic having n-2 degrees of freedom, where n is the number of rows of X. The confidence bounds are based on an asymptotic normal distribution of 0.5*log((1+R)/(1-R)), with an approximate variance equal to 1/(n-3). These bounds are accurate for large samples when X has a multivariate normal distribution. The 'pairwise' option can produce an R matrix that is not positive definite.

As an example, if we generate random data having correlation between column 4 and the other columns.
x = randn(30,4); % Uncorrelated data
x(:,4) = sum(x,2); % Introduce correlation.
[r,p] = corrcoef(x) % Compute sample correlation and p-values.
[i,j] = find(p<0.05); % Find significant correlations.
[i,j] % Display their (row,col) indices.

r =
1.0000 -0.3566 0.1929 0.3457
-0.3566 1.0000 -0.1429 0.4461
0.1929 -0.1429 1.0000 0.5183
0.3457 0.4461 0.5183 1.0000

p =
1.0000 0.0531 0.3072 0.0613
0.0531 1.0000 0.4511 0.0135
0.3072 0.4511 1.0000 0.0033
0.0613 0.0135 0.0033 1.0000

ans =
4 2
4 3
2 4
3 4
***************************

Now in the particular data sets of interest, we have
data =

2 6 7 25 34
3 9 12 15 34
6 16 21 28 32
6 10 13 21 23
4 18 26 27 34
3 6 17 27 32
3 11 21 22 35
1 2 8 17 27
7 12 14 24 31
3 7 14 18 27
7 13 22 25 31
7 12 23 31 32
4 17 18 22 35
8 15 17 20 25
12 16 18 29 34
2 7 11 16 21
8 23 24 32 35
17 19 23 29 31
9 16 27 28 32
6 15 19 26 32
6 13 15 23 31

and data2 =

11 17 19 28 31
3 11 29 32 35
14 21 24 28 33
9 14 22 23 31
3 21 26 30 31
5 15 20 27 29
2 23 24 25 26
7 13 20 24 25
3 23 26 27 28
6 20 21 26 29
1 10 14 19 35
12 18 27 32 35
2 6 24 27 28
3 8 11 21 30
9 14 20 25 31
4 13 19 21 28
10 11 12 21 31
2 7 11 20 24
6 17 25 29 30
13 23 24 26 34
9 17 21 25 26



We then compute correlations....

[r,p]=corrcoef(data)

r =

1.0000 0.6782 0.5378 0.5900 0.1331
0.6782 1.0000 0.7823 0.6477 0.4311
0.5378 0.7823 1.0000 0.7075 0.4356
0.5900 0.6477 0.7075 1.0000 0.5493
0.1331 0.4311 0.4356 0.5493 1.0000


p =

1.0000 0.0007 0.0119 0.0049 0.5652
0.0007 1.0000 0.0000 0.0015 0.0510
0.0119 0.0000 1.0000 0.0003 0.0484
0.0049 0.0015 0.0003 1.0000 0.0099
0.5652 0.0510 0.0484 0.0099 1.0000


[i,j] = find(p<0.005);
[i,j]

ans =

2 1
4 1
1 2
3 2
4 2
2 3
4 3
1 4
2 4
3 4


We repeat the process for the second data set data2, in this way:

[r2,p2]=corrcoef(data2)
r2 =

1.0000 0.3869 0.1578 0.2600 0.3635
0.3869 1.0000 0.5874 0.4726 0.1181
0.1578 0.5874 1.0000 0.8598 0.2710
0.2600 0.4726 0.8598 1.0000 0.3576
0.3635 0.1181 0.2710 0.3576 1.0000


p2 =

1.0000 0.0832 0.4944 0.2550 0.1053
0.0832 1.0000 0.0051 0.0305 0.6102
0.4944 0.0051 1.0000 0.0000 0.2348
0.2550 0.0305 0.0000 1.0000 0.1115
0.1053 0.6102 0.2348 0.1115 1.0000

Next we check for significant correlation at p < 0.005

>> [i,j] = find(p<0.005);
>> [i,j]

ans =

4 3
3 4


you will notice that the Matlab package indicates more unexpected correlations in the data array than in the data2 array, so you might conclude that the process responsible for generating data2 was much more capable of random outputs than the process used to generate the data array. This agrees with Joel's identification of the student generated list (given here as data)

Brian Whatcott Altus OK


On 2/18/2014 9:46 AM, Rauber, Joel wrote:
The second list was the random list. As noted, one cannot prove which one was the random list, you can only make a probabilistic guess.

I looked at two factors, the number of times consecutive numbers appear -> leads to 2nd list is random
The number of times numbers in the range [30-35] appeared compared to the other decade ranges, which also lends evidence that the second list was the random one.
/snip/