I wrote yesterday about the distinctions between assessment and feedback. This sparked some interesting comment which I want to explore here. I posted a diagram which Nick Rose and I designed for our forthcoming book. The original version of the figure looked like this:
We decided to do away with B – ‘Unreliable but valid?’ in the interests of clarity and simplicity. Sadly though, the world is rarely clear or simple.
Clearly D is the most desirable outcome – the assessment provides reliable measurements which result in valid inferences about what students know and can do. It’s equally clear that A is utterly undesirable. The interesting questions are about B (Is it possible to make a valid inference if an assessment is unreliable?) and C (To what extent do we make invalid inferences about reliable assessments?)
Dylan Wiliam, left a great comment on the previous post which goes some way to addressing both questions:
… an assessment cannot be valid and unreliable. Reliability is therefore a prerequisite for validity. But this creates a conceptual difficulty, because reliability is often in tension with validity, with attempts to increase reliability having the effect of reducing validity. Here’s the way I have found most useful in thinking about this.
Validity is a property of inferences based on assessment results. The two main threats to valid inferences are that the assessment assesses things it shouldn’t (e.g., a history exam assesses handwriting speed and neatness as well as historical thinking) and that the assessment doesn’t assess things it should (e.g., assessing English by focusing only on reading and writing, and ignoring speaking and listening). When an assessment assesses something it shouldn’t, this effect can be systematic or random. If bad handwriting lowers your score in a history examination, this is a systematic effect because it will affect all students with bad handwriting. If however, the score is lowered because of the particular choice of which topics students have revised, then this is a random effect. The particular choice of topics in any particular exam will help some students and not others. As another random effect, some students get their work marked by someone who gives the benefit of the doubt, and others get their work marked by someone who does not. We call the random component of assessments assessing things they shouldn’t an issue of reliability.
This is a very useful way for teachers to think about the mistakes we can make when assessing students.
However, it turns out that there’s a ‘scholarly debate’ about all this. Some theorists take the view that an unreliable assessment can have validity. Reliability is seen as invariance and validity independently seen as unbiasedness. Pamela Moss argues that, “There can be validity without reliability if reliability is defined as consistency among independent measures.” If there is little agreement about what ‘good’ looks like, then assessment becomes increasingly subjective. For instance, trying to judge the merit of an art portfolio or a short film submitted as part of a media studies course, is hard. Assessors can generally agree on whether some is rubbish but struggle to agree on how good work submitted might be in any objective sense. Just because there is disagreement about the merits of a piece of work does not invalidate anyone’s opinions.
Li (2003) has suggested that there “has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability.” Increasing validity decreases reliability and vice versa. Heather Fearn has blogged persuasively about this tension between reliability and validity when it comes to assessing history. This, I would argue, is precisely why comparative judgement allows us to make more reliable judgements when standards are subjective.
It also turns out as well as considering content validity – the extent to which an assessment samples from the subject domain being assessed – we should also think about face validity. This is an attempt to measure the extent to which we take the results of an assessment at face value. Psychometricians generally ignore all this as being unspeakably vague, but social scientist quite like the idea of ‘common sense’ interpretations: if our inferences are plausible and fit with our view of the world, then they are valid.
I’ve written before about Messick’s model of validity and how this informs teachers’ interpretation of assessments, but even this might not go far enough. I’m currently picking my way through Newton and Shaw’s fascinating book, Validity in Educational and Psychological Assessment to try to get my head round all this. Basically, they seem to be saying that assessments should strive hard to measure only what they purport to measure and have sufficient validity for us to be able to say with some degree of accuracy what students can and can’t do. Further, although validity is the central concern of test designers, it’s not the only important consideration. Acceptability is, in some ways, ‘bigger’ in that if students, teachers, parents and employers don’t accept that a test on ethical or legal grounds, then a psychometrician’s opinion isn’t likely to count for much.
False positives and negatives
Javier Benítez, a medical doctor endlesslyinterested in education debates, asked that if educational assessments are akin to assessments in other fields, why don’t we talk more about false positives and false negatives? This is an important question.
A false positive would result in a student ‘passing’ an exam in a subject they knew little about and false negative would be where a student ‘failed’ despite being knowledgable about a subject. How often does this sort of thing go on? Well, depending on the positioning of grade boundaries, potentially quite a lot. Grades can result in spurious accuracy: a student with an C grade must be more proficient than a student with a D grade. On average, this will be true, but when 1 mark separates the C/D boundary, can we really make that sort of claim? No. According to Ofqual, there’s only a 50-50 chance that a student at a boundary will have been placed on the right side.
And if we don’t trust the reliability of an exam, any such inference about a student’s ability must be invalidated. When we consider that many GCSE and A level subject have an inter-rater reliability coefficient (the likelihood that two exam markers would award the same marks) of between 0.6-0.7 this is a real and pressing concern. Dylan Wiliam estimated that if test reliability is as good as 0.9, about 23% of students would be misclassified, while if it is 0.85, about 27% will be misclassified.
According to Daniel Koretz, when an assessment has a reliability of 0.9, if 90% of students pass an exam and the test was remarked,6% of students who had passed or failed would change their status. If only 70% passed, then 12% would change on remarking. Seeing as national exams are nowhere near as reliable as 0.9, this would suggest there are an awful lot of false positives and negatives every year.
The extent to which considerations of reliability and validity affect teachers and students is enormous. I hope this post goes some small way to prising a crowbar between the bars of your certainty and helping you to acknowledge that there’s far more to assessment than you ever imagined when you signed up as a teacher. This might all seem tediously abstract but I think familiarity with these concepts will help make teachers better at teaching.
Thank you for putting this up.
Very interesting, thank you
Interesting, and reminds me of a measurement thing in science: that between accuracy and precision. Accurate and precise measurements of the same thing are represented by “D”. Diagram “A” represents a set of measurements that are not precise, but could, as a collection, provide an accurate answer (if they are averaged). So precision is about how close together repeated measurements of the same thing are (eg acceleration due to gravity) while accuracy is the “true” or accepted value (if there is one) as determined by lots of other measurements with more sophisticated equipment and by clever/disciplined/experienced people. Looks like we have some parallels here.
Hi Peter, yes: precision and accuracy are also important components of assessment. I originally came across the ideas in Nate Silver’s The Signal & The Noise in which he uses the target metaphors to illustrate precision && accuracy.
The important point about Pamela Moss’s argument is that she does not use the term “reliability” in the way that is generally used in psychometrics. So, in other words, you can have validity without reliability but only if you change the meaning of reliability.
And there’s more about precision and accuracy in a paper I wrote in 1992 in the (now defunct) British Journal of Curriculum and Assessment: “Some technical issues in assessment: a user’s guide.”
Yes, I realise that – it’s why I signalled the ‘scholarly debate’ in inverted commas 🙂
Some additional points:
1. Yes Ofqual is right to say that if a student is very close to the cut-score then their chances of being correctly classified is 50%, but this is a logical, not an empirical observation.
2. To say that “Increasing validity decreases reliability and vice versa” is rather confusing if reliability is part of validity (which I think is the only way of looking at this that makes sense). If reliability is a component of validity, then what we are doing is more helpfully thought of as trading off one aspect of validity for another. Specifically, increasing reliability reduces one aspect of what psychometrician call construct-irrelevant variance (when the assessment assesses things it shouldn’t) and this is typically done by narrowing the things the assessment assesses (in the jargon, we have introduced a degree of construct under-representation).
3. Within a psychometric framework, face validity is generally not regarded as part of validity at all. Most technical definitions of face validity focus on whether the assessment appears to be assessing the things it is meant to be assessing. A classic court case in the US focused on whether items that did not seem to be about firefighting could be used in a test for applicants to a fire department. The court held (stupidly in my view) that items that looked as if they should be relevant to firefighting could be included, but others could not, even if the latter kinds of items actually predicted more accurately who would successfully qualify as firefighters after a training programme. This is an aspect of face validity. Within Messick’s framework, face validity can be regarded as part of the consequential basis of test interpretation and use, although it’s a bit forced.
4. There is a strong, rich, and long history of looking at false positive and false negatives in educational research, especially when scores are used to classify people. There is a particularly elegant theory called “Signal detection theory” (SDT) which allows cut scores to be set with regard to the fact that false negatives may be more or less important than false positives (think failing a doctor who should have passed versus passing a doctor who should have failed). In the early days of the National Curriculum, a number of people explored the use of SDT until we realized that politicians had no interest in the reliability of national curriculum assessments. They were both valid and reliable because the government said they were.
I take your point on (2) that reliability is part of validity, and maybe I’m using the wrong terminology, but there seems to have been a move over the past 5-10 years to increase inter-rater reliability by insisting on ever-more strict interpretations of rubrics which, because a rubric can only ever indicate possible content from a domain allows for less broad, nuanced inferences from markers. If this isn’t trading reliability against validity, what is it?
You are talking about assessment here not feedback.
The water is indeed muddied when the two terms are used interchangeably. I recently wrote a peice on feedback and assessment
https://leadouteducation.com.au/2016/07/23/the-best-case-scenario-of-assignment-feedback-and-how-lmss-compare-in-allowing-it-to-happen/
and realized how much of what we do, tries to serve the two masters.
For assessment, especially of the high stakes variety, validity and reliability is essential. For feedback not so much.
The comparative approach advocated by Daisy Christodoulou – Life Beyond Levels and facilited by https://nomoremarking.com/ seems to show great promise.
You are absolutely right that there has, probably over the last 20 years, been a steady trend for increasing one aspect of reliability by tighter and tighter marking schemes, including rubrics. But I don’t think it helps to think of this as an issue of reliability versus validity. Rather, it is trading off one aspect of validity for another. Specifically, by increasing reliability (i.e., by reducing the random component of construct-irrelevant variance) the assessment process now under-represents the construct of interest (i.e., valid inferences about certain aspects of performance are less warranted). These two key ideas—construct-irrelevant variance and construct under-representation—may seem like jargon of the worst kind, but once you get your head around them, they are tremendously powerful ways of clarifying issues in assessment.
In this context, it is also worth noting that “inter-rater reliability” is not the only, and often not even the most important, aspect of reliability. The difference in marks awarded by the same individual marker, hour to hour, is almost as great as the difference between markers. And, particularly in our examination system, the impact of the particular choice of questions for the exam is generally an even bigger source of variation. Some people don’t was to treat this latter aspect of assessment as reliability, but it really is, since you are basically asking how lucky the candidate was, in the choice of marker, in the choice of which questions came up, in how good the student felt that day.
The one important document is the AERA/APA/NCME ‘Standards for Educational and Psychological Testing’ 2014. Important standards from that document are also available in the chapter by Lauress L. Wise and Barbara S. Plake ‘Test design and development following the Standards … ’ in the 2016 ‘Handbook of Test Development’ edited by Suzanne Lane, Mark R. Raymond and Thomas M. Haladyna. Routledge has made that chapter available in its preview of the book (pp 19-39) here
I do not know if the following is useful, it is rather technical, and it certainly is not the received view in the world of assessment. Anyhow.
On ‘False positives and negatives’: in the psychometric literature this kind of talk is the usual way of treating the ‘unreliability of decisions’. More often than not, it is not the correct analysis of pass-fail decisions. The curious fact is that Edgeworth already in 1888 gave a fine treatment of the fairness of selection decisions (civil service exams), especially around the cut-off point.
Francis Y. Edgeworth (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51, 599-635. here
An authorized summary of this and a second article is published in the little book by P. J. Hartog (1918). Examinations and their relation to culture and efficiency.. London: Constable. pdf
Talk of ‘false positives and negatives’ assumes a threshold utility function on the variable (mastery, IQ, whatever) tested for. That is an extremely crude model of the value mastery has (for the institution?), not fair at all, in my opinion. However, it is not at all clear what a reasonable utility function on mastery could be, in particular situations. It is possible to try out a few functions, and do some robustness analyses. There is a catch: it is important to distinguish between stakeholders. The party that is being served by psychometricians is the institution (teacher, school, boss, firm). In educational assessment, however, the primary stakeholder is the (individual) pupil. The pupil should be in a position to be able to adequately predict the result on the coming assessment, for example by getting an opportunity to sit a try-out or preliminary assessment. The utility structure of the assessment is radically different for ‘school’ and pupil (institutional versus individual decision-making, a distinction made by Cronbach & Gleser, see below). In my feeling talk about reliability and validity of assessment should recognize the difference. The Standards do so in a general way by emphasizing the uses to which test scores are to be put. Ultimately, adequate models have to be developed; an example is the work by Robert van Naerssen (his work on selection was mentioned in the 2nd edition of Cronbach & Gleser 1965 Psychological tests and personnel decisions.), extended by myself here. Even a simple model is complex, illustrating how talking in a loose way about reliability and validity of assessment will not bring us very far in specific circumstances—especially so where politicians have to be convinced.
[…] word or phrase. I’ve put a couple of examples of this below. This would make the assessment invalid. The result does not correctly tell me what a student does or does not […]