Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: science for all?



Hi John-
This is frustrating, because I can't see that you are dealing with
the point that I am making. Would you mind stating your understanding of
my posting?
Regards,
Jack

On Wed, 2 Jan 2002, John Clement wrote:

On Mon, 31 Dec 2001, John Clement wrote:

Most of the educational
literature effect sizes are less than 1.0 and a curriculum
which achieves
anything over 0.5 is usually considered to be very effective.
We're not concerned with "usually". We"re doing straight
mathematics. If the curve is taken to represent an estimate of a
probability distribution, and the distribution is normal,
then an effect
size of 1.0 may be interpreted as a 30% chance that the mean did not
change.

Absolutely not. The standard deviation being used is not the
variance of an
individual students scores it is the variation over the
population. Each
student is in a different state. The 30% chance would be correct if one
compared the variation of a single students' scores with the
change in score
of that student. If one could give the same student the same
test multiple
times, one would come up with a much smaller variance than one
sees for the
whole ensemble of students. There is no way of measuring the
variance for
the score of 1 individual student, and it will be different
from one student
to the next.

That is exactly the point. You not only do not know the
individual variances (meaning the variance of the distribution of this
individual's scores) you don't know the distribution. In your
spectroscopy line example, you know quite a bit about the
width distribution of the line's illumination just from observing it;
that example is therefore not analogous.


The variance in an individual student's scores is much lower than the SD of
the scores of all students. Let us take 2 assumptions and do a little
simple analysis assuming N students.

1. Assume each student has a score which has an error of one question or
+-1. When you average all the scores the error in the average is 1/sqrt(N)
2. On the other hand let us assume that all students have an error in the
score of +-E questions. The error in the average would be E/sqrt(N)

The SD of the student scores is always larger than 1 and also larger than E.
If it were equal to E then each student would have the same score, and the
only reason for variation would be random chance. This latter assumption is
definitely not true. It will only be true when all the students are
answering all questions randomly, or when all have identical understanding
or lack of it.

The quoted variance in the average is generally in most papers is
SD/sqrt(N), which is actually larger than either of the estimates above. I
would contend that the generally quoted variance in the mean is actually
extremely conservative. The spectroscopy example however would have a
variance in the mean of SD/sqrt(N).


Student scores are not clean data which are not easily compared
using just 1
figure of merit. Student scores are determined by a large number of
variables, so some of the simple strategies employed in
comparing physics
data are not appropriate. The situation in physics education
research is
more analogous to the early days of the physical sciences.

Not at all. The mathematics of statistical inference has also
come a long way since those days. There are standard tests for estimating
whether samples from unknown distributions come from the same
distribution ("the null hypothesis" in your case). The trouble with the
effect size measure is that there is a huge overlap of the two
distributions when the effect size is 1, but you know nothing about the
populations in the overlap region. As you have noted, when you look at
individual student score, pre- and post- test, some decrease and some
gain, but the effect size test gives you no clue as to where any increase
is coming from.
My "30%" comment was just intended to call attention to this
overlap.

But 30% is nonetheless not accurate. Now I think I see what the problem is.
The data is certainly statistically analyzed to see if the result is
significant, but when doing this type of research that is not really enough.
All you really need to insure good accuracy is a large enough sample.
Certainly it is true that the effect size does not tell you where the
increase came from, but then no statistical test can tell you where the
increase came from. It is only by comparing different experiments that one
can see what variables have an effect on gain. The effect size is an
attempt to quantify the amount of relative gain so that comparisons can be
made. One must try to make judgements as to which treatments must be
pursued. Often this means that pursuing one avenue may preclude another
one. One idea behind effect size is probably that an educational
improvement which is small compared to the variation between students is not
likely to be insignificant.

The biggest criticism of education research lies not with the statistics.
The big problem is with controlling other variables, and selecting the
things that you wish to test, and the nature of the tests. Oh yes, some
papers in straight education do not give good statistical information, but
that is not generally true of PER papers.

One thing that is very evident when one does pre and posttesting is that
there are some very large differences between tests. Tests such as the
Lawson test very seldom show negative gain. Similarly the FMCE and FCI also
have similar characteristics, but not as strongly as the Lawson test.
However standard content tests usually do not behave in this fashion, they
indeed can show significant increases or decreases for an individual
student.



The initial curve (Lawson test) in JCST (figure 2) essentially
looks like a
normal distribution. The final one is also similar, but moved
over by about
1 SD. I am judging this by the curve. The result is that
the number of
students who would be classified as concrete is
dramatically reduced. I
have found for the Lawson test that when one looks at
individual student
scores they do not all move up, but rather each student moves a
different
amount, with some making dramatic gains, and others none at
all. The curve
in JCST unfortunately moves so far to the right that the right
hand tail is
cut off by saturation on the test.

A test may be likened to a measurement in the lab. When the
needle pegs, the measurement is invalid.

But this is not a needle. One has the curve and only part of
the RH tail is
missing. One can still deduce what the mean will be by
analyzing the curve.

Only if you have great faith that the curve was not influenced by
the fact that the test was one that gave a cut-off tail.


Granted, that "faith" may enter into this for an unknown test. However the
Lawson test is very different from normal content tests in that it seldom
shows negative gain on an individual student, even after a long period of
time. In that respect it is more like an IQ test than a content test. One
can safely say that the mean showed a dramatic increase. Since the maximum
decrease one ever observes on an individual student is 2 points, and this
happens with reasonably low frequency, one can find the mean fairly
reliably. One other fact about the Lawson test is that students with high
scores have a very low frequency of missing the questions that cause
negative gain. Since the cutoff is at the high end, one can confidently
state that it has no influence on the low end and little influence at the
peak when one looks at the curve. Knowledge and experience with the test is
very important in truly understanding the results.




As I recall the original discussion centered on the fact that
student test
scores tended to fall dramatically after 2 weeks away from class. BTW I
looked up your response to the TPT article that shows that
student scores on
the FCI do not seem to decrease with time after a reformed
physics course.
You correctly said that the study only showed that for the
students tested,
the scores did not fall. While this is true as far as it goes, it is
unlikely that scores only stayed the same for the tested students.

I don't think that's quite what I said. As I recall, the thrust
of my remarks was that the paper seemed to be dealing with a very biased
sample.


After looking up your old posts you certainly questioned whether or not the
sample was representative, but you did not present any evidence for bias.
However your questioning of the statistical accuracy on 128 students now
reveals that you might have been making the same mistake about the
statistics. I think you were assuming that because student scores spread,
that a change in the mean which is comparable to the SD of the scores means
it is not significant. Often just the means are quoted, because the sample
size is large enough that the mean will be very accurate. You also
expressed skepticism about the FCI. I consider this last one to be the most
reasonable concern. Enough good physicists have looked at the questions and
agreed that they should be answerable by physics students that I for one am
led to think that it is testing important ideas. Up to this point in time
no other good evaluations of mechanics have been published other than the
FCI or FMCE. Perhaps the critics of these tests could come up with some
good alternatives?

John M. Clement
Houston, TX


--
"But as much as I love and respect you, I will beat you and I will kill
you, because that is what I must do. Tonight it is only you and me, fish.
It is your strength against my intelligence. It is a veritable potpourri
of metaphor, every nuance of which is fraught with meaning."
Greg Nagan from "The Old Man and the Sea" in
<The 5-MINUTE ILIAD and Other Classics>