The Learning Spy - David Didau

Jul 12, 2016

#Dylan Wiliam #reliability #Daniel Koretz #validiity

I wrote yesterday about the distinctions between assessment and feedback. This sparked some interesting comment which I want to explore here. I posted a diagram which Nick Rose and I designed for our forthcoming book. The original version of the figure looked like this:

We decided to do away with B - 'Unreliable but valid?' in the interests of clarity and simplicity. Sadly though, the world is rarely clear or simple.

Clearly D is the most desirable outcome - the assessment provides reliable measurements which result in valid inferences about what students know and can do. It's equally clear that A is utterly undesirable. The interesting questions are about B (Is it possible to make a valid inference if an assessment is unreliable?) and C (To what extent do we make invalid inferences about reliable assessments?)

Dylan Wiliam, left a great comment on the previous post which goes some way to addressing both questions:

... an assessment cannot be valid and unreliable. Reliability is therefore a prerequisite for validity. But this creates a conceptual difficulty, because reliability is often in tension with validity, with attempts to increase reliability having the effect of reducing validity. Here’s the way I have found most useful in thinking about this.

Validity is a property of inferences based on assessment results. The two main threats to valid inferences are that the assessment assesses things it shouldn’t (e.g., a history exam assesses handwriting speed and neatness as well as historical thinking) and that the assessment doesn’t assess things it should (e.g., assessing English by focusing only on reading and writing, and ignoring speaking and listening). When an assessment assesses something it shouldn’t, this effect can be systematic or random. If bad handwriting lowers your score in a history examination, this is a systematic effect because it will affect all students with bad handwriting. If however, the score is lowered because of the particular choice of which topics students have revised, then this is a random effect. The particular choice of topics in any particular exam will help some students and not others. As another random effect, some students get their work marked by someone who gives the benefit of the doubt, and others get their work marked by someone who does not. We call the random component of assessments assessing things they shouldn’t an issue of reliability.

This is a very useful way for teachers to think about the mistakes we can make when assessing students.

However, it turns out that there's a 'scholarly debate' about all this. Some theorists take the view that an unreliable assessment can have validity. Reliability is seen as invariance and validity independently seen as unbiasedness. Pamela Moss argues that, "There can be validity without reliability if reliability is defined as consistency among independent measures." If there is little agreement about what 'good' looks like, then assessment becomes increasingly subjective. For instance, trying to judge the merit of an art portfolio or a short film submitted as part of a media studies course, is hard. Assessors can generally agree on whether some is rubbish but struggle to agree on how good work submitted might be in any objective sense. Just because there is disagreement about the merits of a piece of work does not invalidate anyone's opinions.

Li (2003) has suggested that there "has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability." Increasing validity decreases reliability and vice versa. Heather Fearn has blogged persuasively about this tension between reliability and validity when it comes to assessing history. This, I would argue, is precisely why comparative judgement allows us to make more reliable judgements when standards are subjective.

It also turns out as well as considering content validity - the extent to which an assessment samples from the subject domain being assessed - we should also think about face validity. This is an attempt to measure the extent to which we take the results of an assessment at face value. Psychometricians generally ignore all this as being unspeakably vague, but social scientist quite like the idea of 'common sense' interpretations: if our inferences are plausible and fit with our view of the world, then they are valid.

I've written before about Messick's model of validity and how this informs teachers' interpretation of assessments, but even this might not go far enough. I'm currently picking my way through Newton and Shaw's fascinating book, Validity in Educational and Psychological Assessment to try to get my head round all this. Basically, they seem to be saying that assessments should strive hard to measure only what they purport to measure and have sufficient validity for us to be able to say with some degree of accuracy what students can and can't do. Further, although validity is the central concern of test designers, it's not the only important consideration. Acceptability is, in some ways, 'bigger' in that if students, teachers, parents and employers don't accept that a test on ethical or legal grounds, then a psychometrician's opinion isn't likely to count for much.

False positives and negatives

Javier Benítez, a medical doctor endlesslyinterested in education debates, asked that if educational assessments are akin to assessments in other fields, why don't we talk more about false positives and false negatives? This is an important question.

A false positive would result in a student 'passing' an exam in a subject they knew little about and false negative would be where a student 'failed' despite being knowledgable about a subject. How often does this sort of thing go on? Well, depending on the positioning of grade boundaries, potentially quite a lot. Grades can result in spurious accuracy: a student with an C grade must be more proficient than a student with a D grade. On average, this will be true, but when 1 mark separates the C/D boundary, can we really make that sort of claim? No. According to Ofqual, there's only a 50-50 chance that a student at a boundary will have been placed on the right side.

And if we don't trust the reliability of an exam, any such inference about a student's ability must be invalidated. When we consider that many GCSE and A level subject have an inter-rater reliability coefficient (the likelihood that two exam markers would award the same marks) of between 0.6-0.7 this is a real and pressing concern. Dylan Wiliam estimated that if test reliability is as good as 0.9, about 23% of students would be misclassified, while if it is 0.85, about 27% will be misclassified.

According to Daniel Koretz, when an assessment has a reliability of 0.9, if 90% of students pass an exam and the test was remarked,6% of students who had passed or failed would change their status. If only 70% passed, then 12% would change on remarking. Seeing as national exams are nowhere near as reliable as 0.9, this would suggest there are an awful lot of false positives and negatives every year.

The extent to which considerations of reliability and validity affect teachers and students is enormous. I hope this post goes some small way to prising a crowbar between the bars of your certainty and helping you to acknowledge that there's far more to assessment than you ever imagined when you signed up as a teacher. This might all seem tediously abstract but I think familiarity with these concepts will help make teachers better at teaching.

The Learning Spy Substack is a sharp, provocative dispatch from the front lines of education, where ideas are tested, myths are challenged, and nothing is taken for granted.

Join me on Substack

When assessment fails