Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: science for all?



Most of the educational
literature effect sizes are less than 1.0 and a curriculum
which achieves
anything over 0.5 is usually considered to be very effective.
We're not concerned with "usually". We"re doing straight
mathematics. If the curve is taken to represent an estimate of a
probability distribution, and the distribution is normal, then an effect
size of 1.0 may be interpreted as a 30% chance that the mean did not
change.

Absolutely not. The standard deviation being used is not the variance of an
individual students scores it is the variation over the population. Each
student is in a different state. The 30% chance would be correct if one
compared the variation of a single students' scores with the change in score
of that student. If one could give the same student the same test multiple
times, one would come up with a much smaller variance than one sees for the
whole ensemble of students. There is no way of measuring the variance for
the score of 1 individual student, and it will be different from one student
to the next. An analogous situation would be observing the change of the
mean of a spectroscopy line. One can observe changes which are fractions of
the spread of the line, and use this information reliably. The width of the
line does not by itself determine the reliability of the mean value of the
line. That is also determined by the number of samples and the precision of
the apparatus used to measure the line.


Normally this
would be used to compare the effects of 2 different curricula
and the effect
size would be calculated for the difference between the curricula.
Obviously the effect size is not a valid comparison tool when
the students
come in with a statistically zero score, or a score that could
be produced
by random guessing.
How does a "statistically zero score" differ from a "zero score"?

Statistically zero would be a score which is due entirely to random
guessing. For example on a 4 choice multiple choice test this would be
around a 25% score. Of course statistically zero can be converted to a
score which represents knowledge by deducting about 1/3 point for a 4 answer
MC test or 1/4 point for a 5 answer MC test for each wrong answer. This sort
of analysis is routinely done by the testing services.


Most teachers and education researchers should be
familiar with effect size, so it will convey meaning. Whether
or not this
is the best way to compare curricula can certainly be
questioned, but it is
currently the method often used.
As one famous logician put it, the answer to "You don't clean a
watch with butter", is "But it was the best of butter". Isn't
that what you're really saying?

Student scores are not clean data which are not easily compared using just 1
figure of merit. Student scores are determined by a large number of
variables, so some of the simple strategies employed in comparing physics
data are not appropriate. The situation in physics education research is
more analogous to the early days of the physical sciences. As long as
experiments are properly controlled, one can compare data and make
conclusions based on these comparisons. For example when you have matched
populations of students the initial states will be similar and the SD and
pretest scores will be the same within statistical accuracy. At that point
comparing gains is a very valid thing to do. Experience has shown that the
difference in the mean between the experimental and control group is
generally less than the SD of the scores, and that this comparison is a
reasonable way to present the results. Of course papers will also present
the curves, as well as the calculated means...


The initial curve (Lawson test) in JCST (figure 2) essentially
looks like a
normal distribution. The final one is also similar, but moved
over by about
1 SD. I am judging this by the curve. The result is that the number of
students who would be classified as concrete is dramatically reduced. I
have found for the Lawson test that when one looks at individual student
scores they do not all move up, but rather each student moves a
different
amount, with some making dramatic gains, and others none at
all. The curve
in JCST unfortunately moves so far to the right that the right
hand tail is
cut off by saturation on the test.

A test may be likened to a measurement in the lab. When the
needle pegs, the measurement is invalid.

But this is not a needle. One has the curve and only part of the RH tail is
missing. One can still deduce what the mean will be by analyzing the curve.


One must also look at the error
on the mean to see if the gain comparisons are significant. If
one has a
fairly large number of students the error on the mean will not be
significant.
The error on the mean is defined by the curve, not by the
number of students. You seem to be mixing two very different
concepts in the last statement.

No, the error on the mean is the variance of the mean. This is
significantly smaller than the SD for the plot of student scores and is the
SD for the curve/sqrt(N-1) where N is the number of students.



I am a bit puzzled about how you would ask for the probability that they
came from the same unknown distribution. Student test scores
generally rise
rather than fall after instruction.

This is called "testing for the null hypothesis". You believe
that the test scores rose, but let's ask for the probability that
they did not.

My impression is that you did not understand my previous posting,
which was about mathematical statistics. Let's just stay with that topic
if you wish to continue this thread; we can omit references to the sins
of other practitioners.

One of the problems with physics and regular education research is that
experimenters must deal with data for which there may not be a precise
mathematical understanding. One must then come up with measurements which
make sense and can be used to conduct the research. These measurements are
often arrived at by analogy or are determined by experience. As research is
conducted better methods will emerge. I would submit that the real problem
is that people in one field of study often do not see the problems in other
fields and as a result tend not to have respect of these other fields. The
only way to find out what is actually going on is to do extensive reading in
the other fields and then form some conclusions about the problems.

As I recall the original discussion centered on the fact that student test
scores tended to fall dramatically after 2 weeks away from class. BTW I
looked up your response to the TPT article that shows that student scores on
the FCI do not seem to decrease with time after a reformed physics course.
You correctly said that the study only showed that for the students tested,
the scores did not fall. While this is true as far as it goes, it is
unlikely that scores only stayed the same for the tested students. The
students were collected randomly by being offered a bribe, and they knew
there was no penalty for taking the test. I know that when I was a student
I would probably spend 30 min willingly taking a 30 question MC test for a
bribe of $10 no matter what the subject of the test was. As such this is
reasonable evidence for little drop in scores. I would point out that very
accurate sampling is done in this sort of way by sampling only part of the
entire population.

Thanks for Brian's post which came in as I was writing this. Please, the
"effect size" is not my terminology, but is used in much of the standard
education research literature.

John M. Clement
Houston, TX