Chronology | Current Month | Current Thread | Current Date |
[Year List] [Month List (current year)] | [Date Index] [Thread Index] | [Thread Prev] [Thread Next] | [Date Prev] [Date Next] |
On Mon, 31 Dec 2001, John Clement wrote:
then an effectMost of the educational
literature effect sizes are less than 1.0 and a curriculumwhich achieves
anything over 0.5 is usually considered to be very effective.We're not concerned with "usually". We"re doing straight
mathematics. If the curve is taken to represent an estimate of a
probability distribution, and the distribution is normal,
variance of ansize of 1.0 may be interpreted as a 30% chance that the mean did not
change.
Absolutely not. The standard deviation being used is not the
individual students scores it is the variation over thepopulation. Each
student is in a different state. The 30% chance would be correct if onechange in score
compared the variation of a single students' scores with the
of that student. If one could give the same student the sametest multiple
times, one would come up with a much smaller variance than onesees for the
whole ensemble of students. There is no way of measuring thevariance for
the score of 1 individual student, and it will be differentfrom one student
to the next.
That is exactly the point. You not only do not know the
individual variances (meaning the variance of the distribution of this
individual's scores) you don't know the distribution. In your
spectroscopy line example, you know quite a bit about the
width distribution of the line's illumination just from observing it;
that example is therefore not analogous.
The variance in an individual student's scores is much lower than the SD of
the scores of all students. Let us take 2 assumptions and do a little
simple analysis assuming N students.
1. Assume each student has a score which has an error of one question or
+-1. When you average all the scores the error in the average is 1/sqrt(N)
2. On the other hand let us assume that all students have an error in the
score of +-E questions. The error in the average would be E/sqrt(N)
The SD of the student scores is always larger than 1 and also larger than E.
If it were equal to E then each student would have the same score, and the
only reason for variation would be random chance. This latter assumption is
definitely not true. It will only be true when all the students are
answering all questions randomly, or when all have identical understanding
or lack of it.
The quoted variance in the average is generally in most papers is
SD/sqrt(N), which is actually larger than either of the estimates above. I
would contend that the generally quoted variance in the mean is actually
extremely conservative. The spectroscopy example however would have a
variance in the mean of SD/sqrt(N).
using just 1
Student scores are not clean data which are not easily compared
figure of merit. Student scores are determined by a large number ofcomparing physics
variables, so some of the simple strategies employed in
data are not appropriate. The situation in physics educationresearch is
more analogous to the early days of the physical sciences.
Not at all. The mathematics of statistical inference has also
come a long way since those days. There are standard tests for estimating
whether samples from unknown distributions come from the same
distribution ("the null hypothesis" in your case). The trouble with the
effect size measure is that there is a huge overlap of the two
distributions when the effect size is 1, but you know nothing about the
populations in the overlap region. As you have noted, when you look at
individual student score, pre- and post- test, some decrease and some
gain, but the effect size test gives you no clue as to where any increase
is coming from.
My "30%" comment was just intended to call attention to this
overlap.
But 30% is nonetheless not accurate. Now I think I see what the problem is.
The data is certainly statistically analyzed to see if the result is
significant, but when doing this type of research that is not really enough.
All you really need to insure good accuracy is a large enough sample.
Certainly it is true that the effect size does not tell you where the
increase came from, but then no statistical test can tell you where the
increase came from. It is only by comparing different experiments that one
can see what variables have an effect on gain. The effect size is an
attempt to quantify the amount of relative gain so that comparisons can be
made. One must try to make judgements as to which treatments must be
pursued. Often this means that pursuing one avenue may preclude another
one. One idea behind effect size is probably that an educational
improvement which is small compared to the variation between students is not
likely to be insignificant.
The biggest criticism of education research lies not with the statistics.
The big problem is with controlling other variables, and selecting the
things that you wish to test, and the nature of the tests. Oh yes, some
papers in straight education do not give good statistical information, but
that is not generally true of PER papers.
One thing that is very evident when one does pre and posttesting is that
there are some very large differences between tests. Tests such as the
Lawson test very seldom show negative gain. Similarly the FMCE and FCI also
have similar characteristics, but not as strongly as the Lawson test.
However standard content tests usually do not behave in this fashion, they
indeed can show significant increases or decreases for an individual
student.
the number of
looks like a
The initial curve (Lawson test) in JCST (figure 2) essentially
normal distribution. The final one is also similar, but movedover by about
1 SD. I am judging this by the curve. The result is that
dramatically reduced. Istudents who would be classified as concrete is
individual studenthave found for the Lawson test that when one looks at
the RH tail isscores they do not all move up, but rather each student moves adifferent
amount, with some making dramatic gains, and others none atall. The curve
in JCST unfortunately moves so far to the right that the righthand tail is
cut off by saturation on the test.
A test may be likened to a measurement in the lab. When the
needle pegs, the measurement is invalid.
But this is not a needle. One has the curve and only part of
missing. One can still deduce what the mean will be byanalyzing the curve.
Only if you have great faith that the curve was not influenced by
the fact that the test was one that gave a cut-off tail.
Granted, that "faith" may enter into this for an unknown test. However the
Lawson test is very different from normal content tests in that it seldom
shows negative gain on an individual student, even after a long period of
time. In that respect it is more like an IQ test than a content test. One
can safely say that the mean showed a dramatic increase. Since the maximum
decrease one ever observes on an individual student is 2 points, and this
happens with reasonably low frequency, one can find the mean fairly
reliably. One other fact about the Lawson test is that students with high
scores have a very low frequency of missing the questions that cause
negative gain. Since the cutoff is at the high end, one can confidently
state that it has no influence on the low end and little influence at the
peak when one looks at the curve. Knowledge and experience with the test is
very important in truly understanding the results.
student test
As I recall the original discussion centered on the fact that
scores tended to fall dramatically after 2 weeks away from class. BTW Istudent scores on
looked up your response to the TPT article that shows that
the FCI do not seem to decrease with time after a reformedphysics course.
You correctly said that the study only showed that for thestudents tested,
the scores did not fall. While this is true as far as it goes, it is
unlikely that scores only stayed the same for the tested students.
I don't think that's quite what I said. As I recall, the thrust
of my remarks was that the paper seemed to be dealing with a very biased
sample.
After looking up your old posts you certainly questioned whether or not the
sample was representative, but you did not present any evidence for bias.
However your questioning of the statistical accuracy on 128 students now
reveals that you might have been making the same mistake about the
statistics. I think you were assuming that because student scores spread,
that a change in the mean which is comparable to the SD of the scores means
it is not significant. Often just the means are quoted, because the sample
size is large enough that the mean will be very accurate. You also
expressed skepticism about the FCI. I consider this last one to be the most
reasonable concern. Enough good physicists have looked at the questions and
agreed that they should be answerable by physics students that I for one am
led to think that it is testing important ideas. Up to this point in time
no other good evaluations of mechanics have been published other than the
FCI or FMCE. Perhaps the critics of these tests could come up with some
good alternatives?
John M. Clement
Houston, TX