Unit 9

Item Analysis of Classroom Tests: Aims and Simplified Procedures

Aim:

How well did my test __distinguish__ among students according to the __how well they met my learning goals__?

Recall that each item on your test is intended to sample performance on a particular learning outcome. The test as a whole is meant to estimate performance across the full domain of learning outcomes you have targeted.

Unless your learning goals are minimal or low (as they might be, for instance, on a readiness test), you can expect students to differ in how well they have met those goals. (Students are not peas in a pod!). Your aim is not to differentiate students just for the fun of it, but to be able to measure the differences in mastery that occur.

One way to assess how well your test is functioning for this purpose is to look at how well the individual items do so. The basic idea is that a good item is one that good students get correct more often than do poor students. You might end up with a big spread in scores, but what if the good students are no more likely than poor students to get a high score? If we assume that you have actually given them proper instruction, then your test has not really assessed what they have learned. That is, it is "not working."

An item analysis gets at the question of whether your test is working by asking the same question of all __individual__ items—how well does it discriminate? If you have lots of items that didn’t discriminate much if at all, you may want to replace them with better ones. If you find ones that worked in the wrong direction (where good students did worse) and therefore lowered test reliability, then you will definitely want to get rid of them.

In short, item analysis gives you a way to exercise additional quality control over your tests. Well-specified learning objectives and well-constructed items give you a headstart in that process, but item analyses can give you feedback on how successful you actually were.

Item analyses can also help you diagnose __why__ some items did not work especially well, and thus suggest ways to improve them (for example, if you find distracters that attracted no one, try developing better ones).

Reminder

Item analyses are intended to assess and improve the __reliability__ of your tests. If test reliability is low, test validity will necessarily also be low. This is the ultimate reason you do item analyses—to improve the validity of a test by improving its reliability. Higher reliability will not necessarily raise validity (you can be more consistent in hitting the wrong target), but it is a prerequisite. That is, high reliability is necessary but not sufficient for high validity (do you remember this point on Exam 1?).

However, when you examine the properties of each item, you will often discover how they may or may not actually have assessed the learning outcome you intended—__which is a validity issue__. When you change items to correct these problems, it means the item analysis has helped you to improve the likely validity of the test the next time you give it.

The procedure (apply it to the sample results I gave you)

- Identify the upper 10 scorers and lowest 10 scorers on the test. Set aside the remainder.
- Construct an empty chart for recording their scores, following the sample I gave you in class. This chart lists the students down the left, by name. It arrays each item number across the top. For a 20-item test, you will have 20 columns for recording the answers for each student. Underneath the item number, write in the correct answer (A, B, etc.)
- Enter the student data into the chart you have just constructed.
- Take the top 10 scorers, and write each student’s name down the left, one row for each student. If there is a tie for 10th place, pick one student randomly from those who are tied.
- Skip a couple rows, then write the names of the 10 lowest-scoring students, one row for each.
- Going student by student, enter each student’s answers into the cells
of the chart.
. Any empty cell will therefore signal a correct answer.__However, enter only the wrong answers (A, B, etc.)__ - Go back to the upper 10 students. Count how many of them got Item 1
correct (this would be all the empty cells). Write that number at the
bottom of the column for those 10. Do the same for the other 19 questions.
We
will call these sums R
_{U}, where U stands for "upper." - Repeat the process for the 10 lowest students. Write those sums under
their 20 columns. We will call these R
_{L}, where L stands for "lower." - Now you are ready to calculate the two important indices of item functioning. These are actually only estimates of what you would get if you had a computer program to calculate the indices for everyone who took the test (some schools do). But they are pretty good.
. This is just the proportion of people who passed the item. Calculate it**Difficulty index**__for each item__by adding the number correct in the top group (R_{U}) to the number correct in the bottom group (R_{L}) and then dividing this sum by the total number of students in the top and bottom groups (20).. This index is designed to highlight to what extent students in the upper group were more likely than students in the lower group to get the item correct. That is, it is designed to get at the differences between the two groups. Calculate the index by subtracting R**Discrimination index**_{L}from R_{U}, and then dividing by__half__the number of students involved (10)- You are now ready to enter these discrimination indexes into a second chart.
- Construct the second chart, based on the model I gave you in class. (This is the smaller chart that contains no student names.)
- Note that there are two rows of column headings in the sample. The
first row of headings contains the
__maximum possible__discrimination indexes for each item difficulty level (more on that in a moment). The second row contains possible difficulty indexes. Let’s begin with that second row of headings (labeled "p"). As your sample shows, the entries range on the far left from "1.0" (for 100%) to ".4-0" (40%-0%) for a final catch-all column. Just copy the numbers from the sample onto your chart. - Now copy the numbers from the first row of headings in the sample (labeled "Md").
- Now is the time to pick up your first chart again, where you will find
the
__discrimination__indexes you need to enter into your second chart. - You will be entering its
__last__row of numbers into the body of the second chart. - List each of these discrimination indexes in one and only one of the
20 columns. But which one? List each in the column corresponding to its
__difficulty__level. For instance, if item 4’s difficulty level is .85 and its discrimination index is .10, put the .10 in the difficulty column labeled ".85." This number is entered, of course, into the row for the fourth item - Study this second chart.
- How many of the items are of medium difficulty? These are the best, because they provide the most opportunity to discriminate (to see this, look at their maximum discrimination indexes in the first row of headings). Items that most everybody gets right or gets wrong simply can’t discriminate much.
- The important test for an item’s discriminability is to compare it to the maximum possible. How well did each item discriminate
__relative to__the maximum possible for an item of its particular difficulty level? Here is a rough rule of thumb. - Discrimination index is near the maximum
__possible__= very discriminating item - Discrimination index is about half the maximum possible = moderately discriminating item
- Discrimination index is about a quarter the maximum possible = weak item
- Discrimination index is near zero = non-discriminating item
- Discrimination index is
__negative__= bad item (delete it if worse than -.10) - Go back to the first chart and study it.
- Look at whether all the distracters attracted someone. If some did not attract any, then the distracter may not be very useful. Normally you might want to examine it and consider how it might be improved or replaced.
- Look also for distractors that tended to pull your best students and
thereby
__lower__discriminability. Consider whether the discrimination you are asking them to make is educationally significant (or even clear). You can’t do this kind of examination for the sample data I have given you, but keep it in mind for real-life item analyses. - There is much more you can do to mine these data for ideas about your items, but this is the core of an item analysis.

R_{U} + R_{L}

20

Record these 20 numbers in a row near the bottom of the chart.

R_{U} - R_{L}

10

Record these 20 numbers in the last row of the chart.

If you are lucky

If you use scantron sheets for grading exams, ask your school whether it can calculate item statistics when it processes the scantrons. If it can, those statistics probably include what you need: the (a) difficulty indexes for each item, (b) correlations of each item with total scores for each student on the test, and (c) the number of students who responded to each distracter. The item-total correlation is comparable to (and more accurate than) your discrimination index.

If your school has this software, then you won't have to calculate any item statistics, which makes your item analyses faster and easier. It is important that you have calculated the indexes once on your own, however, so that you know what they mean.