|
|
|

Testing
and Grading
Stanford C. Erinkson
Stanford C. Ericksen was appointed the founding director of
CRLT in 1962 and for the next 20 years wrote and edited the Memo to the Faculty
series published by CRLT. He is now retired and living in Florida where he produces
a similar publication entitled Update On Teaching for the University of Florida.
The present report has been adapted from one of these recent publications
Fair play is the first and final requirement in matters of testing
and grading. Students will accept pressures for hard work but object strenuously
and rightly so, to signs of unfairness in a teacher's assessment of their efforts.
Being an expert in an area of subject matter and having the speaking skills
required for teaching are quite different dimensions of professional competence
than are the abilities to construct discriminating examinations and to assign
valid grades. Improvement on the part of instructors in the areas of testing
and grading is nearly always in order.
An important distinction must be made between evaluation and grading. Evaluation
is information provided to the student about particular aspects of what was
said or done during the effort to learn, to solve a problem, or to organize
and integrate facts and concepts. As they move into unknown intellectual territory,
students must have guideposts to confirm that they are moving in the right direction.
The qualitative comments about particular aspects of a term paper are far more
constructive aids for the specifics of learning and remembering than is the
grade on the cover page. Evaluation, therefore, is indispensable to students
for gaining understanding and to fix what is learned in memory. A grade, on
the other hand, is a gross index which typically comes too late for the student
to take corrective measures about the specifics of learning.
A few guidelines about testing and grading can help instructors to: (1) strengthen
the process of instruction, (2) clarify the diagnostic value of testing, (3)
make a fair assessment of what each student knows, and (4) report this achievement
through grades.
Testing as a Tool for Instruction
Students tend to concentrate their study effort in preparation for an exam,
and they structure this effort in anticipation of the nature of the questions
they will be asked. If students anticipate the need to know unassimilated facts,
they will concentrate on memorizing information; if they expect to be asked
to integrate, extend, and evaluate information, they will try to prepare themselves
along those lines. The management of testing is an opportunity for the instructor
to underline the essential elements making up the course.
As a matter of fact, a program for the orientation and training of beginning
college teachers could well be geared to the interdependence among: the objectives
of a course, the sequence of topics (and their classroom presentation), and
the manner in which this can be assessed by means of tests, papers, and special
projects. I recall a science professor whose overriding goal was "to teach
students to think like a scientist thinks" but whose tests were almost
solely measures of how well students memorized. He changed his exams to emphasize
integration of material, and everyone felt better about the course.
The Diagnostic Use of Tests
Placement testing is commonly used at the department and college level, but
within our own courses we can also make effective use of similar testing for
making a grade-free diagnostic appraisal of what information is already known
by the students or is not known but should be. Diagnostic testing is an excellent
instructional tool because when a student says, in effect, "I don't see
why the question was scored that way," an inquiry is started toward unscrambling
the false connections. In this close-up look, the teacher may note a pattern
of mistakes showing a misunderstanding of a particular rule, procedure, or principle.
It may also appear that a student has the right answer but for the wrong reasons.
A diagnostic test is a sort of intellectual X-ray showing the strengths and
weaknesses in a student's inventory of information, understanding, and skill.
The evaluative emphasis is on the responses to individual test items, on information
prerequisite for understanding the larger concepts and procedures in this particular
course of study.
When students realize the significance to themselves of grade-free probing,
they are more likely to open up and reveal low points in their preparation profile,
anxieties, misconceptions and deficiencies in knowing how to do certain tasks.
A sprinkling of short, diagnostic quizzes early in the term suggests to students
that the teacher cares about how they are doing and is taking corrective steps
to help them along - an excellent climate for starting the semester.
Assessing Achievement
Although test scores in any setting are affected by students' aptitude, study
skills, motivation, background preparation, and the influence of the teacher,
our classroom examinations should be designed primarily to measure subject-matter
achievement. To this end, the teacher and student seek the same wavelength within
an assigned domain of knowledge. A frustrated student expressed a contrary state
of affairs quite clearly, "I don't like to play the professor's game: I've
got a secret, see if you can guess what it is."
Effective classroom instruction is central to student learning, but students
are short-changed if the examinations are trivial, irrelevant, confusing and
tangential to the substance of the course. College teaching is not complete
without an accurate and fair assessment of students' achievement during the
term and at its conclusion.
Objective Tests
Objective (machine-scorable) tests are almost mandatory in large classes, but
constructing such instruments is a demanding task. Although it is tempting for
teachers to make use of commercially available examinations, to pull old tests
from the file, or to overuse test items taken from a teacher's manual, students
are best served when their instructors develop exams tailored to their specific
course and based on sound principles.
Two basic concepts need to guide the development of classroom examinations:
1. Validity refers to whether an instrument measures what it is supposed
to measure. A valid test, therefore, samples what students should have learned
from your course offering. It measures here and-now achievement rather than,
for example, how well a student reads or how much information the student had
gained outside the course. Test items about minutiae and footnote information
are temptingly easy to put together but lack the validity of questions that
elicit a student's understanding of key concepts, important factual data, and
relevant procedures. A valid test is an unambiguous reflection of what is worth
knowing and remembering.
2. Reliability refers to the consistency of an instrument's results.
A good short quiz is better than a poorly constructed long test but, assuming
equal quality of items, a 50-item test is more reliable (stable, consistent)
than a 10-item quiz. The random errors due to ambiguous wording, idiosyncratic
interpretations, distractions, and other flaws are more likely to cancel out
in the longer test, resulting in a more dependable total score. Thus, the easiest
way to reduce the unreliability in the measuring instrument is simply to increase
the length of the test.
Objective tests come in many forms, but the multiple-choice format carries most
of the burden. When carefully worded, multiple-choice items can probe a student's
understanding of factual information, skills and procedures, concrete and abstract
concepts, and the implications from different scales of values. (True-false
items are altogether too constrained to be effective discriminators for most
college courses.)
To strengthen the quality of the set of items used, a complete item analysis
should be made of each new test. This test-of-the-test is mainly to determine
and adjust the difficulty level of each item. It is normal to find that many
of our carefully conceived questions turn out to be too easy or too difficult
or just seem to ride along as excess baggage. Such items use valuable testing
time but add little to the discriminating power of the test. They don't help
to separate the top group of students from the bottom group of achievers.
Because ambiguity of meaning is a persistent problem, the wording of test items
is critical. Careful editing of the draft exam includes close attention to such
pitfalls as cluing the right answer, overlapping correct alternatives, or asking
for a positive answer to a negative question. Good test items are parsimonious
in meaning and simple in wording. It is surprising how quickly excess words
can lead to double meaning or obscure the correct answer. It is appropriate,
however, to expand the stem - the lead-in statement of the multiple-choice question
- by using a relevant quotation or making reference to a particular body of
factual data.
Score the test in a straightforward manner, e.g., in terms of the number of
right answers. Trying to adjust (punish) for guessing may, in effect, simply
open further sources of variability. Combining raw scores from different performance
measures, i.e., tests, term papers, class participation, special projects, etc.,
can easily distort your original intention. The statistical solution is to convert
the different measures to a common scale through the use of some type of standard-score
scale.
Subjective Evaluation
The distinctive value of essay exams or term papers is the freedom they offer
for students to probe and develop the personal meaning of ideas and to express
these thoughts in their own words. To organize an integrated chain of thought,
to elaborate on findings, and to communicate ideas to others are stronger tests
of achievement than is the recognition or recall of isolated units of information.
1. Essay Exams. In an essay examination, the student is staring at a
blank page and generating, from within, a complicated sequence of responses
aimed at organizing a meaningful unit of knowledge. This ability to recall is
a more demanding test of memory than simply to recognize something. As essay
examination elicits the ability to retrieve information but with little help
from presently given cues. The perceptive teacher (reader) can evaluate the
strong and weak points in a written argument even when the student's perception
of a question differs from the teacher's. Evaluative permissiveness can, of
course, go only so far.
A steady and unwavering evaluative state of mind is difficult to sustain when
reading page after page through a set of exams. Three procedural controls help
to reduce the evaluating drift: (1) turn under the front (name) page to forestall
confounding effects from those students we particularly like or dislike; (2)
read one question at a time through the entire set of exam booklets; (3) shuffle
the order of the booklets periodically to balance the inevitable effects of
reader fatigue or an emerging tilt toward one pattern of answers.
2. Term Papers. In some respects, the term paper is the essence of what
a student has gained from the course. It sets forth what the individual student
has learned and how the student has pulled together all the information for
comprehension and understanding. This, in turn, serves to keep the knowledge
available in long-term memory.
A written handout is a useful guide regarding the due date, length, use of references,
comments about style, and any other restrictions or suggestions about the assignment.
It may, for example, be helpful to remind students about the difference between
describing versus analyzing events and ideas. The heavy task of reading these
papers is counterbalanced, somewhat, by the satisfaction of reading the better
papers - some of which can be truly exciting.
Grading a stack of exams and papers is a time consuming and pressured task because,
throughout, the matter of fair play is squarely on the back of the reader. By
way of evaluation, the teacher should indicate in some detail the rationale
for assigning the gross grade, making specific reference to identified parts
of the exam or paper. The instructional value of essay exams and term papers
is practically wiped out if the student receives nothing back other than the
grade.
Grading
Faculty standards for A-grade performance define the meaning of excellence within
the university. We must guard the criteria of achievement, since everyone pays
the price of academic inflation when these standards are lowered. Students work
hard for grades because "making the grade" is personally rewarding
and is an important basis for special awards, admission to advanced training,
and employment prospects ' With such payoff potential it is unfair for a teacher
to be casual or careless in assigning this index of achievement. Judgments about
professional competence must take into account the quality of a teacher's procedures
for testing and grading.
There are two basic options available to instructors for grading student achievement:
1. Norm-referenced grading, more commonly referred to as grading-on-the-curve,
sets the scale of achievement by the average level of class performance. Students
basically compete against one another in this approach.
2. Criterion -referenced grading has the teacher measuring the students
against some absolute standard with respect to what they are expected to learn.
The competition here is between the student and mastery of a finite body of
knowledge.
In practice, these two approaches overlap and merge since a teacher's judgment
about levels of achievement is influenced by the levels of student performance
with which one is accustomed at a given school. Also, the departmental culture
enters into the picture, because a teacher's procedures and standards for testing
and grading are expected to fall in line with the traditions or policies of
the home department.
The danger in grading-on-the-curve is its diminishment of the teacher's responsibility
for evaluating the students' level of understanding against his or her preset
criteria of subject-matter achievement. The final examination, for example,
is a revealing statement sampling the information and skills the teacher believes
should be carried from the course.
With criterion-referenced grading, there is the danger that the instructor may
set the expected level of achievement unrealistically high or low, with the
result that students perceive the exam as inappropriate and unfair.
Grades serve the academic purpose of showing intellectual achievement in a limited
domain defined by books, teachers, laboratories, and the like. They are not
designed to predict success in the off-campus setting where special weight may
be given to information, aptitudes, and personal characteristics extending beyond
the boundaries of teachers and their courses. Only indirectly or on occasion,
do grades reflect a student's tolerance for stress, independent decision-making,
congeniality in human relations, ability to cope with unexpected problems, and
the like. Teachers can best sustain the credibility of the grading system by
making their assessments reflect as fairly as possible how well each student
has achieved the stated objectives of the course.
CRLT • University
of Michigan • 1071 Palmer Commons • 100 Washtenaw Ave. • Ann
Arbor, MI 48109-2218
Phone: (734) 764-0505 • Fax: (734) 647-3600 • Email: crlt@umich.edu
-
Directions
to CRLT -
|