Overheads for Unit 10—Chapter 19 (Interpreting Standardized Test Scores)



OH 1

The Challenge



Technical Challenge


Educational and psychological measures not like pounds or inches

  • No zero point
  • Units of measurement not equal


Methods have been developed to cope with this limitation

  • By providing meaningful frames of reference for interpreting scores
  • By providing ways that give equal units of measurement
  • By providing ways to compare and add very different kinds of scores



Professional Standard


“Should be able to interpret commonly reported scores: [such as] percentile ranks, percentile band scores, standard scores, and grade equivalents.” (Standard 3 for Teacher Competence in Educational Assessment)



OH 2

Methods of Interpreting Test Scores


Raw score


·        Number of point when scored following the scoring directions

  • Has no inherent meaning (neither does % correct)


Criterion-referenced and standards-based interpretations


  1. Definition
    1. student’s score relates to clear description of specific tasks a student can perform
    2. those tasks, in turn, related to specified standards of mastery
    3. no need to consider other students’ scores
  2. Most useful when test designed for this purpose
    1. set of clearly stated learning objectives
    2. enough items to infer degree of mastery or non-mastery of that domain
    3. items selected to actually measure that domain
  3. Guidelines for when can (cautiously) interpret norm-referenced tests in criterion-referenced terms
    1. achievement domains (e.g., objectives) are homogeneous, delimited, and clearly specified?

·        if not, avoid specific descriptive statements

    1. enough items (say, 10) for each type of interpretation?

·        if not, combine items into larger clusters or make only tentative judgments

    1. easy items were omitted to increase discrimination?

·        if so, then scores won’t describe what low achievers can do

    1. used selection-type items only?

·        if so, then scores influenced by guessing

    1. test items provide directly relevant measure of the objectives?

·        if not, base interpretations on what they do measure


Norm-Referenced Interpretation


  1. Definition
    1. student’s score relative to other students (in a norm group)
    2. norm group is carefully defined
    3. no need to look at level of mastery
  2. Derived scores
    1. definition: raw scores converted into numbers that have meaning within a particular comparison group
    2. derived scores needed because simple rankings have limited value
    3. most common types: grade equivalents, percentiles, standard scores
    4. simple to calculate and conversion tables often provided
    5. many types are standard scores (e.g., T-scores, NCE, standard age scores), based on same logic using the normal curve
    6. other types of developmental scales besides GE (e.g., age-equivalents)


Expectancy Tables (chapter 4)


  1. Definition: two-way chart that shows how often students with at each score level (say, SAT math) perform at each level on another valued performance (say, freshmen grades in college)
  2. Don’t need any norms



OH 3

Grade Equivalent Scores



·        Definition: the grade level at which the typical student obtains that raw score

·        Sample interpretation: “student had the same raw score that was average for students in grade 5.6 in the average school”

·        Typical score is determined for each month in a grade: 5.0-5.9

·        Tables provided, so just look up what grade level corresponds to a student’s raw score

·        Widely used, especially in elementary school


Widely Misinterpreted!


  1. Don’t confuse GE norms with standards that all students should attain
  2. Don’t interpret a GE as an estimate of the grade a student should be placed in
  3. Don’t expect all students to gain 1.0 GE each year (the average). Not a realistic goal
  4.  Don’t assume that the units are equal at different parts of the scale (the same difference can mean “just above” or “vastly above” average)
  5. Don’t assume that scores on different tests are comparable
    • Different publishers test fuller ranges of students than others
    • Patterns of growth (variance in scores) may differ across subjects
  6. Don’t interpret extreme scores as dependable estimates of student’s performance (usually extrapolated)




  • Most useful in reporting growth in basic skills in elementary school
  • Least useful for comparing performance on different tests
  • Inequality in grade units will muddle interpretation if you don’t keep it clearly in mind



OH 4

Percentile Rank




·        Definition: the percentage of students in the norm group scoring below a particular raw score (relative position in the group)

·        Widely used and easily understood


Requirements for use


  • A conversion table (from raw scores to percentiles) based on a norm group
  • A norm group (conversion table) that is appropriate for the students taking the test: grade or age, time of year
  • A norm group (conversion table) that is also specific to the exact test being given: test, subtest, form or (difficulty) level of the test
  • Many tests or student groups means many conversion tables
  • Different purposes (comparisons of same child with different groups) require different norms




  1. Must always refer to a student’s percentile rank as relative to a particular norm group
  2. Usually require multiple sets of norms, especially in high school and beyond
  3. Units not equal, especially at the extremes

·        Pattern of inequality is predictable, however

·        Same percentile difference (say, 5 points) reflects a much bigger difference in performance at the extremes than near the average (recall the shape of the normal curve)



OH 5

Standard Scores





  • Standard score—how far above or below average a student scored
  • Distance is calculated in standard deviation (SD) units (a standard deviation is a measure of spread or variability)
  • The mean and standard deviation are for a particular norm group




Based on the “normal curve,” which means that:


  1. Scores are distributed symmetrically around the mean (average)
  2. Each SD represents a fixed (but different) percentage of cases
  3. Almost everyone is included between –3.0 and 3.0 SDs of the mean
  4. The SD allows conversion of very different kinds of raw scores to a common scale that has (a) equal units and (b) can be readily interpreted in terms of the normal curve
  5. When we can assume that scores follow a normal curve (classroom tests usually don’t but standardized tests do), we can translate standard scores into percentiles—very useful!



OH 6

Types of Standard Scores



All Standard Scores


  • Share a common logic
  • Can be translated into each other (see figure 19.2, p. 494)




  • Simplest
  • The one on which all others based
  • Formula: z = (X-M)/SD, where X is person’s score, M is group’s average, and SD is group’s spread (standard deviation in scores
  • Z is negative for scores that are below average, so z’s are usually converted into some other system that has all positive numbers


T- Score


  • Normally distributed standard scores
  • M=50, SD=10
  • Can be obtained from z scores: T  = 50 + 10(z)


Normalized Standard Scores


  • Starts with scores that you want to make conform to the normal curve
  • Get percentile ranks for each score
  • Transform percentiles into z scores using a conversion table (I handed one out in class)
  • Then transform into any other standard score you want (e.g., T-score, IQ equivalents)
  • Hope that your assumption was right, namely, that the scores really do naturally follow a normal curve. If they don’t, your interpretations (say, of equal units) may be somewhat mistaken




  • Very simple type of normalized standard score
  • Ranges from 1-9 (the “standard nines”)
  • Each stanine from 2-8 covers ½ SD
  • Stanine 5 = percentiles 40-59 (the middle 20 percent)
  • A difference of 2 stanines usually signals a real difference
  • Strengths

1.      easily explained to students and parents

2.      normalized, so can compare different tests

3.      can add stanines to get a composite score

4.      easily recorded (only one column)

  • Limitations

1.      like all standard scores, cannot record growth

2.      crude, but prevents overinterpretation


Normal-Curve Equivalents (NCE)


  • Normally distributed standard scores
  • M=50
  • SD=21.06
  • Results in scores that go from 1-99
  • Like percentiles, expect that have equal units (this means that they make fewer distinctions in the middle of the curve and more at the extremes)


Standard Age Scores (SAS)


  • Normally distributed standard scores
  • Put into an IQ metric, where
  • M=100
  • SD=15 (Wechsler IQ Test) or SD=16 (Stanford-Binet IQ Test)



OH 7

Converting among Standard Scores


Easy Convertibility


  • All are different ways of saying the same thing
  • All represent equal units at different ranges of scores
  • All can be averaged (among themselves)
  • Can easily convert one into the other
  • Figure 19.2 on p 494 shows how they line up with each other
  • But interpretable only when scores are actually normally distributed (standardized tests usually are)
  • Downside—not as easily understood by students and parents as are percentiles



OH 8

Using Standard Scores to Examine Profiles




  • You can compare a student’s scores on different tests and subtests when you convert all the scores to the same type of standard score
  • But all the tests must use the same norm group
  • Plotting profiles can show their relative strengths and weaknesses
  • Should be plotted as confidence bands to illustrate fringe of error
  • Interpret scores as different only when their bands do not overlap
  • Sometimes plotted separately by male and female (say, on vocational interest tests), but is controversial practice
  • Tests sometimes come with tabular or narrative reports of profiles (see p. 496)



OH 9

Using Standard Scores to Examine Mastery of Skill Types



  • Some standardized tests try to provide some criterion-referenced information by providing scores on specific sets of skills (see Figure 19.4 on p. 498)
  • Be very cautious with these—use them as clues only, because each skill area typically has very few items



OH 10

Judging the Adequacy of Norms for Standard Scores



Remember Your Aim!


  • To interpret performance relative to a well-defined reference group



Criteria for Judging Norms


  1. Relevant
    • Is this particular norm group appropriate for (a) the decision you want to make and (b) the set of students involved?
  2. Representative
    • Was the norm group created with a random sample or stratified random sample? Does it match census figures (by race, sex, age, location, etc.) for the general population being considered?
  3. Up-to-date
    • Don’t rely on the copyright date of the test manual. Read the manual to see how old the norms are
    • Beware of Lake Wobegon effect!
  4. Comparable

·        If you want to compare scores on tests with different norm groups, check the test manuals how comparable the groups are

  1. Adequately described—look for:

·        Method of sampling

·        Number and distribution of cases in the norming sample

·        Age, race, sex, geography, etc. of norm sample

·        Extent to which standardized conditions were maintained in testing

·        Prefer the tests that described in more detail



OH 11

Cautions in Interpreting Standardized Test Scores



Scores should be interpreted:


  1. With clear knowledge about what the test measures. Don’t rely on titles; examine the content (breadth, etc.)
  2. In light of other factors (aptitudes, educational experiences, cultural background, health, motivation, etc.) that may have affected test performance
  3. According to the type of decision being made (high or low for what?)
  4. As a band of scores rather than a specific value. Always subtract and add 1 SEM from the score to get a range to avoid overinterpretation
  5. In light of all your evidence. Look for corroborating or conflicting evidence
  6. Never rely on a single score to make a big decision