Is it progress if a cannibal uses a fork?
Stanislaw J. Lec
For some time now I’ve been of the opinion that the way we normally think of progress is based on a myth. Part of the problem is that because we tend to believe that we can see learning we routinely miss the fact that what students can do here and now tells us relatively little about what they can elsewhere and later. We assume
In What If Everything We Knew About Education Was Wrong? I argue that
Progress is just a metaphor. It doesn’t really describe objective reality; it provides a comforting fiction to conceal the absurdity of our lives. We can’t help using metaphors to describe learning because we have no idea what it actually looks like. Even though our metaphors are imprecise approximations, the metaphors we use matter. They permeate our thinking. (p. 148-9)
Whenever we attempt to assess learning it’s very hard to see past this metaphor of progress which results in some really dreadful hokum. Many many schools are assessing very poor proxies for learning and concluding that, hey presto! progress is being made. They might be able to pull over inspector’s eyes but I strongly suspect external results will fail to marry up with internal judgements about progress in moare than a few cases. Whenever we assess students’ work we should always ask two questions:
- What is it we actually want to measure?
- How can we go about measuring it with reasonable levels of reliability and validity?
Intuitively, most teachers believe they’re pretty good at using mark schemes but the evidence is against them. There’s a conspiracy of silence about the fact that using assessment objectives (AOs) to judge the quality of pupils’ work is hopelessly flawed. We can definitely tell the difference between an A and E, and we can probably spot the difference between a B and D, but when it comes to the fine judgement necessary to distinguish between a C and D we run into very predictable difficulties. As I discussed here, the human brain is just not very good at distinguishing between these kinds of outcomes.
Examination boards have responded to the inherent weakness of markschemes by dreaming up ways to increase the reliability of their rubrics but all this comes at the cost of validity. Lots of people seem to feel that GCSEs or A levels don’t tell us what we actually want to know. The question is, what do we want to know? In most case the answer is as simple, and as complex, as how good a student is at a particular subject.
Consider, for example, English. Most people will agree that we want to find out how good students are at comprehension (reading) and composition (writing). But what does that actually mean? What is we really want students to be able to do?Most of what we assess are proxies. We design rubrics to be able to tell us everything about a piece of work but they end up telling us nothing useful. Exam rubrics would have us believe that we can assess the quality of a student’s work across a range of AOs but in reality it’s only really possible to assess one thing at a time. An exam should, if it is to tell us anything useful, be designed like a scientific experiment with as many of the variables controlled as possible.
Here are the criteria against which writing is assessed in the new English Language GCSE:
AO5
- Communicate clearly, effectively and imaginatively, selecting and adapting tone, style and register for different forms, purposes and audiences.
- Organise information and ideas, using structural and grammatical features to support coherence and cohesion of texts.
AO6
- Candidates must use a range of vocabulary and sentence structures for clarity, purpose and effect, with accurate spelling and punctuation.
So a single piece of writing might be assessed for clarity, effectiveness, imagination, tone, style, register, organisation of ideas, organisation of information, coherence and cohesion. How many of those things do you feel confident in being able to even define let alone objectively assess? The very best we can hope to get is a vague approximation. In Reliability, validity and all that jazz, Dylan Wiliam estimated that if test reliability was as good as 0.9, about 23% of students would be misclassified. If reliability was 0.85, about 27% will be misclassified. For most GCSE subjects reliability is below 80%.
What on earth can we do when the most reliable estimations of progress we have are so, well, unreliable? Finally, I have something other than a council of despair to offer. On Friday I spent the morning with Dr Chris Wheadon of nomoremarking. He has been using comparative judgement to increase the reliability of human judges to assess quality. Unlike making judgements of quality in isolation, we’re really good at making comparisons.
Here’s an example. How good, on a scale of 1 – 24 is this?
Hard isn’t it? Try using the AOs above. Does that help? What about if I give you a rubric:
Does that help? Even if you did manage to award a mark which a majority of other people agreed with, how long would it take you?
Now try this:
Which is better?
Easy, isn’t it. Judging requires a different mindset to marking and, ideally. you’ll make your judgement in about 30 seconds. Now imagine that the judgements were made by a number of different experts and those judgements were aggregated. How reliable do you think the judgement would be? The formula Dr Wheadon recommends is “multiplying the number of scripts you have by 5. So if you have 10 scripts we would advise 50 judgements. Under this model each script is judged 10 times.”
So what does all this have to do with assessing progress? What Chris and his team have been trialling in a number of schools is running a comparative judgement at the beginning or Year 7 and the running another, on a similar question, at the end of Year 7. In most schools, children are clearly making progress: their answers are, according to the aggregated judgement of a number of experts, better in the second test than the first. In some schools results soar with children achieving vastly superior scores in the space of a year. Alarmingly, in other schools, results tank. All this should, of course, be treated with caution, but surely this is valuable information which every teacher and every school would want to know?Perhaps predictably, the schools in which results have dropped have quietly withdrawn form the trial, but how else can we ever reliably know if teaching is having a positive effect?
It seems to me that this kind of comparative judgement provides rich information about students’ apparent progress which, as long as the data wasn’t used punitively, could be hugely useful in informing teachers’ professional development and for schools to have a clear sense of what was happening within years. I suggested to Chris that “no more marking” failed to convey the power of his system. Maybe it would appeal more to schools if marketed as proof of progress”?
In Part 2, I will explore some of the barriers to rethinking assessment.
[…] I set out here, Dr Chris Wheadon has come up with a beautifully simple solution which requires no rubrics, very […]
Really like the look of this. There’s a key phrase towards the end, though: ‘as long as the data wasn’t used punitively’. We really have to try and refocus our assessment on to learning, rather than management and surveillance. A tough battle in the current climate.
It sure is. Ultimately though if schools are determined to make high-stakes judgements I’d prefer them to be based on data which was more reliable and valid.
Anything that creates more non-contact jobs for educators is always warmly welcomed.
So is the comparative judgement made against a rubric/mark scheme/ assessment criteria? There must be some judgement of quality being made?
The only judgement is, Which is best, x or y? It relies solely on the aggregation of expert intuition
I’m genuinely not trying to be awkward, but I’m not sure I could easily say whether one of those was ‘better’ than the other. Which kind of ‘better’ are we talking about? The first one shows some sensitive use of language, and I like how precise the dialogue is. The punctuation is fairly accurate and the child has not just written reams and reams. The second one has more energy and pace, but is let down by punctuation. The vocabulary was perhaps less interesting, but that might have been the writer’s intention. I love the listing in threes at the end, and the phrase ‘She was over.’
I think stories are one example where it is very difficult indeed to make comparative judgements between similar pieces of work, because ‘best’ is so subjective.
Sue: you rely on your intuitive judgement. You might be ‘wrong’, it doesn’t matter: the method relies on the statistical power of aggregating judgements. The whole point is that there IS no definitive way to proving one thing is better than another. A rubric might be a great way to diagnose why a thing was good or bad, but it’s an inherently dishonest way of judging quality. Sometimes we find it easy, sometimes it’s hard. This is one occasion when it’s OK to just go with your gut and trust the process.
I was interested in the fact that you said it was “easy” to tell, since I didn’t find it easy at all. I felt it was more like comparing chalk with cheese. Could I ask which one of the two you felt was best as I’m genuinely not sure which one I would pick. It depends a lot surely on what you are assessing for, especially with stories?
Also, if we’re aggregating judgements, then doesn’t that run the risk of aggregating bias? For instance, it’s been my experience that lots of teachers (myself included) will judge a piece with messy handwriting as ‘worse’ than one with neat handwriting. I have to be very conscious of my tendency to do that if I am meant to be judging a piece on its literary merit and not on its neatness.
I preferred the longer, messier looking one.
I think aggregating judgements is more likely to avoid bias unless you think we all suffer with identical biases. I, for instance, have largely immunised myself against the messy bias 🙂
Are you sure you didn’t prefer it because it was longer, though? Your bias could easily be to do with length rather than neatness.
I am sure yes. I read both in their entirety and decided I liked it better.My bias is a preference for better writing 🙂
I’m trialling the comparative judgment approach with my Y6 Computing class at the moment having heard Chris Wheadon speak at an event recently.
Each child filled out a “What I know about Cryptography” section in a KWL grid at the start of our unit. At the end of the unit, I got them to write a response to the same question. We can then look at judging them comparatively in random pairs using Chris’s website.
Next week I plan to get the children to do some of the judging themselves and will look at the results. Hopefully we will find the second responses are clearly superior to the first ones!
There is so much I find of great interest here. As a former mathematics teacher involved in a 100% GCSE coursework scheme (1986-1995) we used broad assessment criteria based upon students solving problems and writing up their findings/outcomes. Of course our marking was as subjective as marking writing for GCSE English was. As you comment it was ‘easy’ to distinguish between an A grade and a C and a C and an E but one ‘acid’ test between a C and B grade work was, for example, how a student had been able to use and apply concepts such as Pythagoras’ theorem or trigonometry or transformations; a C grade student could derive the formula but would not easily be able to apply it in a non-obvious context. Sadly, as far as mathematics goes there are too many folk who believe it can be assessed objectively according to how many marks are gained on a test. I think testing is facile, unfit for purpose and lack validity or reliability. Thus criterion referencing (CR) at least, has the possibility of greater reliability and in these days of greater focus on problem solving then CR might be the best option
Mike, I have to say, I find your conclusions somewhat odd. If you think tests are ‘facile’ why do you then recommend criterion referencing? Our ability to reference outcomes according to criteria is precisely what I’ve argued we can’t do: it might offer greater validity but only at the cost of reliability.
A standardised test is the only method which offers much in the way of reliability and by using comparative judgement we can get this reliability without the validity trade off.
I’ve written more about the problems with teacher assessment here: https://www.learningspy.co.uk/assessment/tests-dont-kill-people/
[…] far from perfect, at making direct comparisons when two things are in front of us. As explained here, Chris Wheadon’s No More Marking system for making comparative judgements of students’ […]
[…] idea might be to benchmark students’ performance in an assessment here and now, and then use comparative judgment to see whether performances have improved elsewhere and […]
[…] November Rethinking assessment Part 1: How can we tell if students are making progress? – It’s all about comparative judgement, […]
[…] are making progress and standards are improving is to aggregate comparative judgement as described here. Otherwise, book checks should sample the quality of the work, not the quality of the […]
[…] allows for much greater reliability and validity in our assessments of students’ work. (See this post this post for […]
[…] Tick n flick is clearly unsuitable for reason 1 which obviously demands a grade of some sort.As I’ve explained before, this is a process best undertaken by aggregated comparative judgement. […]
Rather than go through 2 comparative processes wouldn’t it be easier to compare the same child’s work from the start and at the end of the year? Surely this is a quicker way to confirm if progress had been made?
I’m not sure what you mean. It wouldn’t make any sense to compare the same piece of work, would it? And if you mean compare two separate pieces of work just relying on teacher assessment then you’re back in the territory of unconscious bias. The only way to confirm progress with the right combination of accuracy, reliability is to remove individual judgement. The only way to establish progress with any validity is to include human judgement. To get a true picture the only real option for anyone interested in anything other than confirming bias is to aggregate experts’ comparative judgement.
If quickness is the only concern then the fact this is done remotely by someone else should remove that as a factor, surely?
I think I don’t fully understand the mechanics of the whole process. Looking forward to your blog on the nuts and bolts of exactly how it works. I’m yet to be convinced but I’m yet to be convinced I even understand the process.
[…] and quickly. With this in mind, and inspired by David Didau’s excellent posts such as this, I was keen to explore what it had to offer to the work of a busy English […]
[…] know, I’m interested in the potential of comparative judgement (CJ) and have written about it here and […]
[…] Li (2003) has suggested that there “has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability.” Increasing validity decreases reliability and vice versa. Heather Fearn has blogged persuasively about this tension between reliability and validity when it comes to assessing history. This, I would argue, is precisely why comparative judgement allows us to make more reliable judgements when standards are subjective. […]
[…] Taking the temperature of learning in a lesson Unlocking the power of progress Ten easy ways to demonstrate progress in a lesson Re-thinking assessment – Learning Spy […]