This is #19 in my series on the Top 20 Principles From Psychology for Teaching and Learning and the second of three posts examining how to assess students’ progress: “Students’ skills, knowledge, and abilities are best measured with assessment processes grounded in psychological science with well-defined standards for quality and fairness.”

The more I read on this subject, the more it becomes clear how widely misunderstood testing and assessment are. But does this actually matter? Do teachers need to know about such issues as reliability, precision and validity? Isn’t this just a matter for exam boards and Ofqual? Well, it’s been designated as one of the Top 20 psychological principles that teachers ought to know about. And if the that doesn’t sway you, Dylan Wiliam, in typically bullish form, argues that “it would be reasonable to say that a teacher without at least some understanding of these issues should not be working in public schools.” I reckon that suggests there are an awful lot of teachers at risk of being fired if Dylan is ever made Secretary of State for education!

He suggests there are five areas about which classroom teachers ought to be knowledgeable, summarised from the 2014 Standards for educational and psychological testing.

1. Validity is not a property of tests

The Top 20 report states that we should ask ourselves four questions we considering the validity of an assessment:

  • How much of what you want to measure is actually being measured?
  • How much of what you did not intend to measure is actually being measured?
  • What are the intended and unintended consequences of the assessment?
  • What evidence do you have to support your answers to the first three questions?

These are important considerations, but we also need to understand that “validity is best thought of as a property of inferences based on test outcomes rather than as a property of tests.” (Wiliam 2014) Trying work out whether a test is always valid, is a fool’s errand as different groups of students will vary in the way in which they answer questions.

2. Validation is the joint responsibility of the test developer and the test user

Rather than getting bogged down in ensuring that a test will always be valid before it’s ever taken, we should instead see the responsibility for establishing validity as a joint venture. Wiliam suggests that “test users have a responsibility to determine whether the validity evidence produced by the developer does in fact support the chosen use of the test.” In other words, are we, as teachers, using tests for the purpose for which they were designed? If not, then it’s our own fault if we get meaningless results. This might suggest that using past examination papers for in class preparation or revision will not, and should not be expected to, yield valid information.

3. The need for precision increases as the consequences of decisions and interpretations grow in importance

The reliability of a test is just as important as its validity. We need to trust that results allow us to make meaningful inferences about students’ knowledge, skills, and abilities. Essentially, the higher the stakes the more important precise measurement becomes. This is particularly important when test results are presented as grades, bands or levels. Generally speaking we don’t seem to be keen on ‘spurious precision’. If a student gets 88% on a test we’re likely to believe they’re more able that someone who gets 87%. But this might not be true: these small differences tell us relatively little about a person’s ability. This is perhaps less problematic than ‘spurious accuracy’. We’re even more likely to infer that a student with a B is more able than a student with a C, but this difference in grades might only represent a difference of 1 mark or less. Wiliam suggests that “teachers need to be aware of the inevitability of errors in any test score or grade awarded… Increasing reliability without narrowing the focus of a test almost always means increasing the length of the test, and so increasing reliability almost always involves taking time for testing away from teaching. Relatively low reliability may therefore be optimal if the consequences of decisions are not that important.”

4. A test that is fair within the meaning of the standards reflects the same construct(s) for all test takers

Valid assessments need to be clear about what is and is not supposed to be measured. Fairness is an important component of validity, but what does it mean? Should it suggest that all students have an equally good chance of performing well? Well, yes and no. Yes, because it would obviously be unjust to discriminate against certain groups of students, but no because most tests are designed to reveal those differences, not obscure them. This is particularly problematic for SEN students. Should physical disabilities prevent a students from accessing a test? Of course not. But what about cognitive or emotional issues? To help us think more clearly about these issues it’s useful to distinguish between adaptations and accommodations. An accommodation is a relatively minor change in the way a test is presented, taken or administered in order to provide fair access but still allow for results comparable to a test that hasn’t been accommodated in some way. An adaptation, on the other hand, is intended to “transform the construct being measured, including the test content and/or testing conditions, to get a reasonable measure of a somewhat different but appropriate construct for designated test takers” (Standards p.59) This is clearly more controversial and opens up discussion about how far a test can be adapted whilst still being fair. As the Top 20 report says, “Tests showing real, relevant differences are fair; tests showing differences that are unrelated to the purpose of the test are not.”

5. When an individual student’s scores from different tests are compared, any educational decision based on the comparison should take into account the extent of overlap between the two constructs and the reliability or standard error of the difference score

If test results are used to monitor or track students will need to be aware that they might not be as valid as we might hope. When schools report grades or levels to parents throughout the year, what do these grades or levels actually mean? We think we know what we’re talking about – after all, an A’s an A, right? – but very often we’re wrong. If the scores we’re comparing come from different tests, then we’re comparing apples with pears. We know we’re looking at some fruit, but what does this tell as about students’ relative achievement? Wiliam suggests that “it is not widely appreciated by teachers and administrators that the standard errors of the difference scores are often considerably larger than the student growth over the course of a year.” And clearly, the higher the stakes the less reliable the information is likely to be.

We also need to be aware that tests measure motivation as well as the skills and knowledge ostensibly being tested. If a test has no meaning for students, they may not be motivated to make much effort. With tests which are low stakes for students but very high stakes for teachers and schools, we need to aware that the inferences we can make will be dubious a best. A test is much more likely to provide accurate information on what students have learned when it is low stakes for teachers but high stakes for students.

So, what does all this mean for teachers? We need to be aware of the strengths and limitations of the assessments we design and use. The list of suggestions provided by the Top 20 report seems very balanced. They suggest:

  • Carefully aligning assessments with what is taught.
  • Using a sufficient number of questions overall and variety of questions and types of questions on the same topic.
  • Using item analysis to target questions that are too hard or too easy and are not providing sufficient differentiation in knowledge (e.g., 100% of students answered correctly).
  • Being mindful that tests that are valid for one use or setting may not be valid for another.
  • Basing high-stakes decisions on multiple measures instead of a single test.
  • Monitoring outcomes to determine whether there are consistent discrepancies across performance or outcomes of students from different cultural groups. For example, are some subgroups of students routinely overrepresented in certain types of programming (e.g. SEN)?

References cited in the report