Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: [Phys-L] From a Math Prof (physics BS major) at my institution ( math challenge)



On 2/18/14 at 10:20 pm, Brian Whatcott wrote:

"In a particular matrix math package called Matlab we can quickly compute
correlations (among many other operations) : specifically - R = corrcoef(X)
returns a matrix R of correlation coefficients calculated from an input
matrix X whose rows are observations and whose columns are variables.
[R,P]=corrcoef(...) also returns P, a matrix of p-values for testing the
hypothesis of no correlation. Each p-value is the probability of getting a
correlation as large as the observed value by random chance, when the true
correlation is zero. If P(i,j) is small, say less than 0.05, then the
correlation R(i,j) is significant."

I tried this in Excel using the function CORREL to compute the correlation
coefficients and got the same answers as Brian's MATLAB results. However, I
couldn't find an Excel function to directly compute the probabilities for
the correlation coefficients. Also the Excel CORREL function computed the
coefficients one at a time rather than all at once as MATLAB does, so it's a
bit tedious.

The lesson in this for me is that you need to test both for the expected
distribution (e.g. by looking at the frequency of the data with a chi-square
goodness of fit test as I did) and the randomness of the data by looking for
correlations (as Brian and others did). Unfortunately, in this case the
two ways of looking at the data disagree on the answer. So how do you know
which method has priority? Or, do you simply say, it's unclear?

Don
Dr. Donald Polvani
Anne Arundel Community College
Adjunct Faculty, Physics (Retired)

-----Original Message-----
From: Phys-l [mailto:phys-l-bounces@phys-l.org] On Behalf Of brian whatcott
Sent: Tuesday, February 18, 2014 10:20 PM
To: Phys-L@Phys-L.org
Subject: Re: [Phys-L] From a Math Prof (physics BS major) at my institution
( math challenge)


In a particular matrix math package called Matlab we can quickly compute
correlations (among many other operations) : specifically - R = corrcoef(X)
returns a matrix R of correlation coefficients calculated from an input
matrix X whose rows are observations and whose columns are variables.
[R,P]=corrcoef(...) also returns P, a matrix of p-values for testing the
hypothesis of no correlation. Each p-value is the probability of getting a
correlation as large as the observed value by random chance, when the true
correlation is zero. If P(i,j) is small, say less than 0.05, then the
correlation R(i,j) is significant.

The p-value is computed by transforming the correlation to create a t
statistic having n-2 degrees of freedom, where n is the number of rows of X.
The confidence bounds are based on an asymptotic normal distribution of
0.5*log((1+R)/(1-R)), with an approximate variance equal to 1/(n-3). These
bounds are accurate for large samples when X has a multivariate normal
distribution. The 'pairwise' option can produce an R matrix that is not
positive definite.

As an example, if we generate random data having correlation between column
4 and the other columns.
x = randn(30,4); % Uncorrelated data
x(:,4) = sum(x,2); % Introduce correlation.
[r,p] = corrcoef(x) % Compute sample correlation and p-values.
[i,j] = find(p<0.05); % Find significant correlations.
[i,j] % Display their (row,col) indices.

r =
1.0000 -0.3566 0.1929 0.3457
-0.3566 1.0000 -0.1429 0.4461
0.1929 -0.1429 1.0000 0.5183
0.3457 0.4461 0.5183 1.0000

p =
1.0000 0.0531 0.3072 0.0613
0.0531 1.0000 0.4511 0.0135
0.3072 0.4511 1.0000 0.0033
0.0613 0.0135 0.0033 1.0000

ans =
4 2
4 3
2 4
3 4
***************************

Now in the particular data sets of interest, we have data =

2 6 7 25 34
3 9 12 15 34
6 16 21 28 32
6 10 13 21 23
4 18 26 27 34
3 6 17 27 32
3 11 21 22 35
1 2 8 17 27
7 12 14 24 31
3 7 14 18 27
7 13 22 25 31
7 12 23 31 32
4 17 18 22 35
8 15 17 20 25
12 16 18 29 34
2 7 11 16 21
8 23 24 32 35
17 19 23 29 31
9 16 27 28 32
6 15 19 26 32
6 13 15 23 31

and data2 =

11 17 19 28 31
3 11 29 32 35
14 21 24 28 33
9 14 22 23 31
3 21 26 30 31
5 15 20 27 29
2 23 24 25 26
7 13 20 24 25
3 23 26 27 28
6 20 21 26 29
1 10 14 19 35
12 18 27 32 35
2 6 24 27 28
3 8 11 21 30
9 14 20 25 31
4 13 19 21 28
10 11 12 21 31
2 7 11 20 24
6 17 25 29 30
13 23 24 26 34
9 17 21 25 26



We then compute correlations....

[r,p]=corrcoef(data)

r =

1.0000 0.6782 0.5378 0.5900 0.1331
0.6782 1.0000 0.7823 0.6477 0.4311
0.5378 0.7823 1.0000 0.7075 0.4356
0.5900 0.6477 0.7075 1.0000 0.5493
0.1331 0.4311 0.4356 0.5493 1.0000


p =

1.0000 0.0007 0.0119 0.0049 0.5652
0.0007 1.0000 0.0000 0.0015 0.0510
0.0119 0.0000 1.0000 0.0003 0.0484
0.0049 0.0015 0.0003 1.0000 0.0099
0.5652 0.0510 0.0484 0.0099 1.0000


[i,j] = find(p<0.005);
[i,j]

ans =

2 1
4 1
1 2
3 2
4 2
2 3
4 3
1 4
2 4
3 4


We repeat the process for the second data set data2, in this way:

[r2,p2]=corrcoef(data2)
r2 =

1.0000 0.3869 0.1578 0.2600 0.3635
0.3869 1.0000 0.5874 0.4726 0.1181
0.1578 0.5874 1.0000 0.8598 0.2710
0.2600 0.4726 0.8598 1.0000 0.3576
0.3635 0.1181 0.2710 0.3576 1.0000


p2 =

1.0000 0.0832 0.4944 0.2550 0.1053
0.0832 1.0000 0.0051 0.0305 0.6102
0.4944 0.0051 1.0000 0.0000 0.2348
0.2550 0.0305 0.0000 1.0000 0.1115
0.1053 0.6102 0.2348 0.1115 1.0000

Next we check for significant correlation at p < 0.005

[i,j] = find(p<0.005);
[i,j]

ans =

4 3
3 4


you will notice that the Matlab package indicates more unexpected
correlations in the data array than in the data2 array, so you might
conclude that the process responsible for generating data2 was much more
capable of random outputs than the process used to generate the data array.
This agrees with Joel's identification of the student generated list (given
here as data)

Brian Whatcott Altus OK


On 2/18/2014 9:46 AM, Rauber, Joel wrote:
The second list was the random list. As noted, one cannot prove which one
was the random list, you can only make a probabilistic guess.

I looked at two factors, the number of times consecutive numbers
appear -> leads to 2nd list is random The number of times numbers in the
range [30-35] appeared compared to the other decade ranges, which also lends
evidence that the second list was the random one.
/snip/

_______________________________________________
Forum for Physics Educators
Phys-l@phys-l.org
http://www.phys-l.org/mailman/listinfo/phys-l