Chronology Current Month Current Thread Current Date
[Year List] [Month List (current year)] [Date Index] [Thread Index] [Thread Prev] [Thread Next] [Date Prev] [Date Next]

Re: Is the FCI valid? (LONG!)



I should like to discuss, in order, the points made by Jack Uretsky
in his two Phys-L posts: (a) of 1/7/2000 "Re: data on typical FCI
scores," and (b) of 1/8/2000 "Re: Is the FCI valid?"

A. Jack writes: "The FCI, as I understand it, was never validated by
any acceptable procedure ......."

As pointed out by Ben Crowell, the meaning of the above phrase
depends on Jack's meaning of "validity." In a later post (in
response to Crowell) Jack writes that he is using the term "validity"
more or less in Ben's sense of "correlational validity" (sometimes
called "predictive validity')(1,2): "I use the word here, more or
less, in your second sense - how does an FCI grade correlate with
'successful teaching.'"

WITH THIS DEFINITION OF "VALIDITY" Jack's statement is, I think,
CORRECT. But so what? What procedure could possibly be used to
establish a correlation with "teaching effectiveness"? What
criterion of "successful teaching" would be employed: student
evaluations; peer review; portfolios; teaching awards; alumni
surveys; reputation; average salaries, wealth, or happiness of
ex-students; final exam grades; grades on the MCAT or AP
exams?.......... All of these, especially student evaluations (3) are
fraught with difficulty.

For such reasons, the FCI(4) and like tests such as the Diagnostic
Mechanics (MD) (5) and TUG-K (3) are designed to display high "face"
and "content" validity, not "predictive" validity. For the FCI and
MD tests the content validity is high if the test measures the degree
to which the students' "qualitative understanding of mechanics"(5) is
Newtonian.

Jack's phrase "The FCI, as I understand it, was never validated by
any acceptable procedure ......." is INCORRECT IF "validity" means
"content validity." To quote from ref. 5:

"The face and content validity of the mechanics test was established
in four different ways: First, early versions of the test were
examined by a number of physics professors and graduate students, and
their suggestions were incorporated into the final version. Second,
the test was administered to 11 graduate students, and it was
determined that they all agreed on the correct answer to each
question. Third, interviews of 22 introductory physics students who
had taken the test showed that they understood the questions and the
allowed alternative answers. Fourth, the answers of 31 students who
received A grades in University Physics......(calculus based) were
carefully scrutinized for evidence of common misunderstandings which
might be attributed to the formulation of the question. None was
found."

As regards the FCI, Hestenes et al.(4b) write: "The FCI was designed
to improve on the Mechanics Diagnostic test(5). The results
originally obtained with the Diagnostic have since been replicated
many times by others, so we have great confidence in the reliability
of the test and the conclusions drawn from the data. Further
confirmation comes from the Inventory results in Table III. Indeed
the percentage scores on both tests seem to be quite comparable
measures of Newtonian conceptual understanding."

Since the "reliability" of the FCI has not been questioned on Phys-L,
I shall not discuss it, except to point out that Kuder-Richardson
KR-20 coefficients (1,2) for the FCI and MD as measured at Arizona
State (5) and Indiana University(6b and references therein) have been
consistently above 0.80 indicating relatively a high reliability.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
B. Jack writes: "The FCI, as I understand it, was never validated by
any acceptable procedure so the notion that a 70% FCI score is 70% of
what's acceptable is unsupportable."

As shown above, if "validity" means the extent to which an "FCI grade
correlates with 'successful teaching,'" then first part of the
sentence is correct but irrelevant. Regarding the second part of the
sentence, neither I nor anyone else that I know of ever advanced
"the notion that a 70% FCI score is 70% of what's acceptable." The
<g> = 0.70 barrier does NOT mean a barrier of 70% on the FCI
posttest. It means instead that the average actual gain (<%post> -
<%pre>) is (for the courses of my survey) at most 70% of the maximum
possible gain (<100%> - <%pre>). And even with this correct usage, I
am NOT maintaining that a 70% normalized gain is "70% of what's
acceptable." I'm only giving my opinion that "even the most
effective Interactive Engagement courses of my survey fall far below
their potential."

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

C. Jack writes: "I am addressing the so-called "70% barrier". Dick
Hake treats this barrier as though it signifies that students are
only achieving 70% (roughly) of what they would achieve as a result
of an ideal course ..... Hake's data provides the only correlation
that we have available. We may not, therefore, draw any conclusion
about teaching from the fact that scores top out at 70%."

I do not think that my data provide any "correlation," e.g., any
correlation of FCI scores with "successful teaching." Rather, some
would simply DEFINE at least one facet of "successful teaching" with
high <g>. As I indicated in my 1/5/2000 post "Re: data on typical
FCI scores" what my survey (6) shows is that:

(a) Traditional (T) courses (passive-student lectures, recipe labs,
and algorithmic-problem exams) fail to convey much conceptual
understanding to the average student, yielding an average <g> for 14
courses (N = 2048) of 0.23 ± 0.04sd, where SD stands for standard
deviation.

(b) Interactive-engagement (IE) courses [use of methods designed to
involve students in heads-on (always) and hands-on (usually)
activities that yield immediate feedback through discussion with
peers and/or instructors] can be much more effective than T courses
in enhancing conceptual understanding. For the 48 IE courses of the
survey courses (N = 4458), the average <g> = 0.48 ± 0.14sd. This is
almost two SD's above that of the T courses, reminiscent of
differences seen in comparing instruction delivered to students in
large groups with one-on-one instruction.

(c) Current IE methods need to be improved, since none of the IE
courses achieves <g> greater than 0.69.

Despite Jack's objections, I (1) stand by my previous 1/6/2000 Phys-L
post "Re: data on typical FCI scores" which argues the case for "c,"
and (2) think that Jack's statement "We may not, therefore, draw any
conclusion about teaching from the fact that
.......(NORMALIZED!!)..... scores top out at 70%" is INCORRECT. To
the contrary, one can, in fact, draw the conclusion that Interactive
Engagement courses need improvement.

BTW,
If you wish to respond to this very long posting, PLEASE, out of
courtesy to other list subscribers, avoid the finger-jerk reaction of
hitting the reply button!!(7)


Richard Hake, Emeritus Professor of Physics, Indiana University
24245 Hatteras Street, Woodland Hills, CA 91367
<rrhake@earthlink.net>
<http://www.physics.indiana.edu/~hake>
<http://www.physics.indiana.edu/~sdi>
<http://www.physics.indiana.edu/~redcube>


"Genius is the ability to make all possible mistakes in the shortest
possible time." John Wheeler (possibly derived from Niels Bohr)


REFERENCES
1. For a good discussion of psychometric definition of the term
"validity," see e.g., R.E. Slavin, "Research Methods in Education,"
(Allyn and Bacon, 2nd ed., 1992), esp. Chap. 6 "Measures: Reliability
and Validity" pp. 75-80. Slavin distinguishes five types of validity:

a. "Face": "the items in the test should look as though they measure
what they are supposed to measure."

b. "Content": "the degree to which the content of a test measures
some objective criterion, such as the content of a course or
textbook, the skills required to do a certain job, or knowledge
deemed to be important for some purpose."

c. "Predictive": "the degree to which scores on a scale or test
predict later behavior (or other scores)...... it can be measured by
means of a correlation coefficient between individual's scores on the
scale and their later behavior."

d. "Concurrent": "the correlation between scores on a scale and
scores on another scale or measure of established validity given at
about the same time."

e. "Construct": "the degree to which scores on a scale have a
pattern of correlations with other scores or attributes that would be
predicted by a well-established theory."

2. For a good introduction to the terms "validity" and "reliability,"
and the arduous work required to produce "TUG-K," a test of high
validity and reliability, see R.J. Beichner, "Testing student
interpretation of kinematics graphs," Am. J. Phys. 62(8), 750-762
(1994).

3. See e.g., (a) my Phys-L post "Re: Are student evaluations useful?"
of 11/24/97. (b) For a 1998 discussion-list thread on the contentious
topic of student evaluations as measures of student learning: (1)
bring up the AERA-D archives at
<http://lists.asu.edu/archives/aera-d.html>, (2) click on "Search the
archives," and then (3) type "teacher evals:" into the "subject" slot
(not the "search for" slot) to obtain 18 hits (including a more
readable form of "3a." To see student-evaluation champion Lawrence
Roche's rebuttals of "3a" add to "(1)"-"(3)" above: "(4)" type
"Roche" into the author slot. For yet more AERA-D posts by Roche,
simply type "Roche" in the author slot (with all other slots blank)
to obtain 44 hits as of 1/8/2000. In his latest post of 10/15/99,
Roche steadfastly maintains, despite evidence to the contrary from
physics-education research, that "student ratings seem to be the MOST
valid tool available ...... (to measure teaching effectiveness)."
(His CAPS)

4. (a) I.Halloun, R.R Hake, E.P. Mosca, D. Hestenes, "Force Concept
Inventory (revised, 1995), password protected at
<http://modeling.la.asu.edu/modeling.html>; (b) D. Hestenes, M.
Wells, and G. Swackhamer, "Force Concept Inventory,"
Phys. Teach. 30, 141-158 (1992). The FCI is very similar to the
earlier Mechanics Diagnostic test (ref. 5) and pre/post results using
the former are very similar to those using the latter.

5. I. Halloun and D. Hestenes, "The initial knowledge state of
college physics students, Am. J. Phys. 53, 1043-1055 (1985) -
contains the Mechanics Diagnostic test; "Common sense concepts about
motion, ibid. 1056- 1065; on the web at
<http://physics.indiana.edu/~sdi/>.

6. R.R. Hake (a) "Interactive-engagement vs traditional methods: A
six-thousand-student survey of mechanics test data for introductory
physics courses," Am. J. Phys. 66, 64-74 (1998); (b)
"Interactive-engagement methods in introductory mechanics courses,"
submitted on 6/19/98 to the "Physics Education Research Supplement to
AJP"(PERS); both on the web at <http://physics.indiana.edu/~sdi/>.

7. Why distribute yet again the replied-to post and clutter
everyone's hard drive? Some sage Netiquette advice is given on the
Phys-L homepage at <http://purcell.phy.nau.edu/phys-l/#etiquette>:

"Quote Sparingly: avoid excessively large replies created by quoting
complete original messages (a real problem for DIGEST-mode readers).
Instead, select and keep only appropriate quoted text, indicating
what is quoted and what is not from the original message (many email
programs will do this automatically) and manually pruning out
irrelevant sections. Indicate deletions. Leave enough original
material so you are not enigmatic."