10 Misconceptions about Comparative Judgement

I’ve been writing enthusiastically about Comparative Judgement to assess children’s performance for some months now. Some people though are understandably suspicious of the idea. That’s pretty normal. As a species we tend to be suspicious of anything unfamiliar and like stuff we’ve seen before. When something new comes along there will always be those who get over excited and curmudgeons who suck their teeth and shake their heads. Scepticism is healthy.

Here are a few of the criticisms I’ve seen of comparative judgement:

It’s not accurate.
Ranking children is cruel and unfair.
It produces data which says whether a child has passed or failed.
It attempts to rank effort as well as attainment and this isn’t possible without knowing the child involved.
It’s just about data.
It’s worse than levels. Bring back levels! Everything was better in the past.
It leads to focussing on cohorts instead of individuals.
Parents don’t want their children compared to other children.
Learning shouldn’t be measured as an abstract thing.
We should take our time marking students’ work because they’ve taken time to produce it.

There may be others.

Let’s deal with each in turn.

I can absolutely understand why we might feel sceptical that a fast, intuitive judgement can tell us as much as slow, analytical marking. Surely spending 10 minutes poring over a piece of writing, cross-referencing against a rubric has to be better than making a cursory judgement in a few seconds? On one level this may be true. Reading something in detail will obviously provide a lot more information than skim reading it. There are, however, two points to consider. Firstly, is the extra time spent marking worth the extra information gained? This of course depends. What are you planning to do as a result of reading the work? What else could you do with the time? Second, contrary to our intuitions, the reliability of aggregated judgements is much greater than that achieved by expert markers in national exams. GCSE and A level marking for essay based examinations is between 0.6-0.7. This indicates that there’s a 30-40% probability that a different marker would award a different mark. Hence why so many papers have their marks challenged every year. But, if we aggregate a sufficient number of judgements (5 x n) then we end up with a reliability above 0.9. Although any individual judgement may be wildly inaccurate, on average they will produce much more accurate marks than an expert examiner.
It may well be both cruel and unfair to rank children; I’m genuinely ambivalent about that. However, a comparative judgement doesn’t attempt to rank children, just their work. Teacher assessments, on the other hand, are much more likely to judge the child rather than the work as investigations into the ‘Halo effect’ have consistently shown. We are all inadvertently prone to biases which end up privileging students based on their socio-economic background, race and gender. If anything, comparative judgement is less cruel and less unfair than marking.
We might feel squeamish about the idea of an assessment that produces data about whether children have passed or failed but that is the purpose of assessment. Think about it: what’s the point of setting an assessment which failed to provide you with information about a child’s current performance?
I would be completely against using comparative judgement to rank students’ effort as well as attainment. It really isn’t possible to say anything meaningful about a child’s without knowing them. Thankfully, I’ve never heard of anyone using CJ in this way.
The criticism that CJ is just about data collection is bizarre. The purpose of the judgements is to focus on the actual work produced by students rather than on trying to using a rubric to assign a mark. Numbers are entirely optional and some of my most successful experiences of using CJ have been when we didn’t make any effort to collect or record data.
I too miss NC Levels. There was a lot about them to like. A group of subject experts spent months thinking deeply about children’s progression and produced an astonishingly detailed and useful set of documents. I’m saddened by the mad rush of many schools to create their own inferior versions. The point is that using CJ has nothing to do with whether you also choose to use some kind of level system. We just have to understand what levels can and can’t do: they great to help us understand why one piece of work is better than another, but terrible at helping us assign a mark. My advice: use Levels after CJ to make sense of the rank order you end up with.
Some people are concerned that producing a rank order means that teachers will end up generalising about an amorphous cohort rather than thinking about children as individuals. This anxiety is understandable, but thankfully misplace. As mentioned above, CJ focuses teachers’ on the work students’ produce. The judging is only the first part of the process. Once work has been ranked detailed conversations about that work are provoked. If anything, this helps us better understand why an individual may be falling down and helps us pinpoint how we can help them.
As a parent, I’m squeamish about the idea of my children being compared to others. My youngest daughter is in Year 6 and about to collect her SATs and we’re all waiting with bated breath. Inevitably, she’ll find out how she compares to her classmates. But what’s the alternative? Not giving parents grades at all? I may not want my daughter to feel upset about how well she’s done compared to others but I don’t think I’m alone in being pretty keen to get some kind of objective measure of how she’s done. The real point is that comparing children has nothing at all to do with comparative judgement as we saw in point 2 above. That said, what CJ does offer is the ability to show progress much more reliably than any other assessment method. Most parents are, I think, very interested in knowing whether their children are making progress.
We should absolutely try to avoid talking about learning in the abstract. This is hard because learning is abstract. You can’t see it, touch it, or taste it. Because of this we come up with metaphors to try to make it more tangible. This is how we end up having conversations revolving around sub-levels of progress, or predicted grades, as if they actually meant something concrete. All assessments provide us with a proxy, this point is whether or not it’s a good proxy. I would argue that CJ allows us to make better inferences about learning as an abstract thing because it’s so focussed on the concrete. The absences of rubrics means we are one step nearer the thing itself. Additionally, not having a rubric also means we are likely to get a more valid sample of students’ ability within a domain. Because a rubric depends on attempting to describe indicative content it warps both teaching and assessment; teachers use mark schemes to define the curriculum and examiners search for indicative content and ignore aspects of great work that didn’t make it into the rubric.
In an ideal world we would put the same effort into reading students’ work as they put into creating it. Sadly, this thinking has led to the steady rise in teachers’ workload and mounting feelings of guilt and anxiety. No teacher, no matter how good they are, will ever be able to sustain this kind of marking for long. But maybe we’ve been asking the wrong question. Maybe instead we should ask, if students have put all this effort into their work, is it fair that we assess it unfairly and unreliably? The other point is that the 30-second intuitive judgement is only desirable during the judging process. In order to provide meaningful feedback of course you actually have to spend time reading the work too.

Another criticism I’ve spotted is that CJ is new and a fad. This too is wrong: the idea has been around for decades.

One final point. Assessment is one of the least well-understood and most important aspects of education. Every teacher ought to have a working knowledge of the concepts of reliability and validity. Dylan Wiliam, in typically bullish form, argues that, “it would be reasonable to say that a teacher without at least some understanding of these issues should not be working in public schools.”

I hope that’s clarified some of the misunderstandings out there. If there are any others, please add them to the comments and I’ll address them there.

David Didau2016-07-07T17:05:02+01:00July 7th, 2016|assessment|

22 Comments

Debaser July 7, 2016 at 1:33 pm - Reply

I can think of another possible issue: teacher ego.

If as a HOD I line up GCSE practice assessments from different classes and rank them using comparative judgement I may be implicitly suggesting that Teacher X is doing a better job than Teacher Y.

Any advice on how to take the ego out of it? (I think CJ is a great idea by the way.)
- David Didau July 7, 2016 at 3:36 pm - Reply
  
  well, if classes are mixed ability and all teacher x’s students are ranked higher than teacher y’s then why *wouldn’t* you want to know that? This doesn’t necessarily mean Teacher y is performing poorly but it does indicate that some sort of intervention is almost certainly required.
  - Andy McHugh July 7, 2016 at 8:26 pm - Reply
    
    I agree David, if the data suggests a potential weakness in staffing then that should trigger closer examination. It doesn’t mean that it should be the only factor examined though. The key is knowing your staff well.
  - Dylan Wiliam July 9, 2016 at 7:15 pm - Reply
    
    Of course it depends what you mean by “intervention” but if the scores of students taught by teacher X are judged higher by comparative judgment than those taught by teacher Y, then it could be that teacher X is a more effective teacher than teacher Y, but there are other interpretations. For a start, teacher X and teacher Y may be trying to achieve different things. The rank order emerging from comparative judgment scoring depends on a relatively coherent community of interpreters. As a concrete example, Hugh Morrison, from Queen’s University Belfast, found that there were systematic differences between grammar school and secondary modern school teachers in what they valued in students’ work. It could be that teacher Y is trying to develop skills that the majority of those doing the comparative judgement do not value. This is the fundamental weakness of comparative judgement. The statistical techniques require a relatively high degree of coherence in the judges about what they value.
    - David Didau July 9, 2016 at 9:00 pm - Reply
      
      Hi Dylan
      
      1) You’re right that higher scores cannot be taken at face value, but it’s not unreasonable that they prompt a conversation.
      
      2) “The statistical techniques require a relatively high degree of coherence in the judges about what they value.” They do indeed – this is why it’s important to track the ‘infit’ score to make sure no judge is out of line with others. If this were to be the case it would, again, prompt a conversation.
      
      3) Are you assuming that teachers and judges are necessarily the same people? There’s no reason why this should be the case. That being the case, whatever an individual teacher values should be irrelevant as a consensus emerges of what is ‘best’. Judging biases should aggregate out.
chriswheadon July 7, 2016 at 5:54 pm - Reply

Really helpful article, thank you. The most common misconception I come across is that all you get from CJ is a rank order! You get far more than that. You get a measurement scale that allows you to place pupils from a wide range of ability on a fixed scale that remains constant over time. We like to talk about measuring across time and space. Across time, because you can track progress against a fixed scale, and across space, because you can share your scale with other schools, and they can measure themselves against your scale.
- David Didau July 7, 2016 at 11:05 pm - Reply
  
  Thanks Chris – a useful addition.
suecowley July 8, 2016 at 7:24 am - Reply

Hi David, I’m assuming that this is a blog in response to my blog since you left me a link in my blog replies. In answer to your points:

1. It’s ranking that’s the problem for me, not issues of ‘accuracy’. Having said that, you run the risk of aggregating bias (e.g. against scruffy handwriting) as I’ve explained here: https://suecowley.wordpress.com/2016/07/07/what-does-good-writing-look-like/

2. I’d probably say that I find it unpleasant and unnecessary, rather than cruel and unfair. But let’s say you had a child with significant literacy SEND. How would you feel about ranking them last every time. Do we rank teachers and how would you feel about that? Are you in favour of PRP?

3. I didn’t say this. This is what SATs ranking does and that’s what I was referring to in the blog.

4. If you read my blog carefully, you’ll notice that I used the term ‘comparative assessment’ when talking about ranking effort and only used the term ‘comparative judgement’ at the end of that paragraph. I have come across schools that rank effort, and this is all part of the idea that ranking is useful, which I would dispute. I don’t think that CJ ranks effort, although I do think it has the potential to because of how we tend to react to children’s writing.

5. it’s obviously about data or otherwise it wouldn’t be done in the way it’s done. That doesn’t mean it is ‘just’ about data.

6. As a parent I think it’s worse than levels. I made it clear that I was talking as a parent in my blog. My parent friends are very confused about the SATs results/rankings.

7. Either you’re focusing on ‘the writing’ as an entity or you’re focusing on ‘the child’. I don’t think you can have it both ways.

8. I don’t, thanks. You are welcome to do that if you want but I find it unhelpful.

9. If learning ceases to be about children, and we view it in too abstract a manner, we run the risk of losing sight of what really matters. It’s all very well to say you’re ranking ‘the work’ but teachers don’t generally separate the two in their minds.

10. I didn’t say this but I do think that marking is very valuable, and a company called ‘no more marking’ obviously disputes this. I find that odd.

Thanks, Sue.
- David Didau July 8, 2016 at 10:52 am - Reply
  
  Hi Sue – yes, I left a link to this on your blog as I felt it addressed some of what I considered to be misconceptions on your part. Please note though I have very carefully avoiding stating that you have said any of these things explicitly.
  
  Thanks for taking the trouble to respond. I will try to address each of your points separately:
  
  1. All assessment ranks children and all assessments of written responses fall victim to handwriting bias as I explained in the post. Objecting to ranking is to object to assessment. Is this your position?
  
  2. If you think it’s unnecessary to rank children’s work, how will how they are doing? Donald Laming explains at great length in Human Judgement that, contrary to what we may believe human beings are exceptionally poor at judging the quality of a thing on its own. We generally know whether we like something but we struggle to accurately evaluate just how good or bad something is. It’s much easier for us to compare two things and weigh up the similarities and differences. This means we are often unaware of what a ‘correct’ judgement might be and are easily influenced by extraneous suggestions. This is compound by the fact that we aren’t usually aware we’ve been influenced. You appear to think that 1) the rank order produced must be shared with children and 2) this if this were done it would be somehow brutal and demotivating. Sharing a rank order with children – especially young or vulnerable children – is probably a bit stupid. Sharing marks is to some extent a statutory requirement, but I would certainly not advice teachers to routine share marks which children. You also introduce the idea of ranking teachers. Firstly, this would probably be fairer than subjective evaluations but is almost certainly impractical: how would you be able to examine two teachers’ performances simultaneously? I’m not sure what any of this has to do with PRP, but no: I’m not in favour of PRP.
  
  3. I never said you did say it.
  
  4. I did notice you used the term comparative assessment but as this post is not a direct response to your blog that seems irrelevant. I’m against not only the idea of ranking effort (mainly because it’s impossible) but I also think awarding an effort score is a highly subjective and pretty dubious practice.
  
  5. Using the word ‘obviously’ in this context is lazy. It’s also incorrect. I’ve conducted many comparative judgements with teachers in which no data was collected. We just used the rank order to talk about the quality of the work.
  
  6. As a parent I think Levels are terribly confusing. My Yr 7 daughter brought home her school report yesterday and her school are continuing to use levels to assess students. She has been told that she is working at a 6a in maths, 6b in science, 6c in English and 4a in Spanish. This all sounds great, but then she’s also been told she’s a 3a in Geography and a 4c in History. This is incomprehensible to me (and very demotivating for her) and I’ve asked the school to give me an appointment to speak to them about it. Now, if instead I was told that her work had been compared to the national standards for Year 7s (something you can do with CJ) and was currently of the standard expected of a student who would go on to be awarded a 7 at GCSEs then I’d know something meaningful. To be clear, I’m not in any way defending the changes in SATs assessment.
  
  7. You say, “Either you’re focusing on ‘the writing’ as an entity or you’re focusing on ‘the child’. I don’t think you can have it both ways.” I think you may have either misread or misunderstood what I wrote. What I actually said in the post is, ” CJ focuses teachers’ on the work students’ produce.” At no point do I claim it’s focussing on the child. In fact I’ve made it very clear that CJ cannot focus on the child.
  
  8. You have no interesting in knowing whether you’re children are making progress academically? Well, fair enough, but I think that’s unusual. As I said, most parents do want this information.
  
  9. I quite agree. The great thing about CJ is it prevent teachers from conflating ‘the work’ with the child. It also prevents us from viewing learning in “too abstract a manner”.
  
  10. I know you didn’t say this. Again, I never claimed you did. I too see some value in marking and so does Chris Wheadon at No More Marking. It would indeed be odd if we saw no value in marking but seeing we do, there’s nothing for you to find odd.
  
  Finally, it’s worth reading Chris’s comment above. He says that one of the misconceptions he encounters is “all you get from CJ is a rank order! You get far more than that. You get a measurement scale that allows you to place pupils from a wide range of ability on a fixed scale that remains constant over time. We like to talk about measuring across time and space. Across time, because you can track progress against a fixed scale, and across space, because you can share your scale with other schools, and they can measure themselves against your scale.”
  
  I hope that helps, David
Gary S July 18, 2016 at 4:42 pm - Reply

Having experimented with CJ last year and produced some results, I think those comments are very much in line with what I found. It is worth pointing out, for point 2, that this was also contentious amongst colleagues who trailled this with me, a comment being that CJ did not consider the relevant starting point of the child, so it seemed to disadvantage SEND students but I did not find that to be wholly true. Removed of the NC level ladder it was possible to see merit in a student’s work that might be missed through only looking for the appropriate criteria of levels and sub-levels.

A further point on the ranking is that I also trialled this as a student self and peer assessment tool. In setting this up for students to login and post their submissions anonymously, there was a notable pique in self-interest and improvement around their work, from those who found their rankings at the top and wished to stay there and those who found themselves at the bottom and wished to improve. As pupils themselves were making judgements of others against collective success criteria, the ranking system here was useful – it was motivational and completely outside of the decision making of the teacher.

I also want to respond to the validity and reliability point of marking. I have seen a reluctance to give up the power and authority marking conveys about the student teacher relationship. Apparently, this could never replace traditional marking. Why not? CJ proved to be more reliable, and in that sense supported stronger professional judgements. The nomenclature of assessment has created a system which arguably is less about the learning of children and more about the measurable attainment of departments, which may often have narrow curricula to achieve this. I think there is a great strength to CJ in that it allows for a more nuanced consideration of the content of an answer, not just the skills. In that sense it rewards and respects the professional judgement of staff, can also give a more reliable outcome across a group but can also set up a meaningful conversation with a pupil about next steps to improve. When the outcome of CJ is that the pupil now has a host of samples to consider in comparison with her/his own, there is more chance they will replicate the concrete improvement needed from their own observations.
Gravely A October 25, 2016 at 9:35 am - Reply

May I add another potential criticism? Comparative judgement not only does not, but it cannot provide feedback, nor a rationale for grades/ranking. The outcome is the result of numerous comparisons between pieces of work and the only information teachers and pupils get from it is that the particular piece of work was ranked at a particular relative position. It does not explain why. Moreover no individual judge in the system can explain why. No individual human agent can justify the grade or rank.

There is no mechanism by which, even across the work as a whole, one can explain what leads to better or worse positions in the ranking. Even with large scale traditional examinations, chief examiners’ reports explain what led to better or worse scores allowing teachers to work on ideas and misconceptions in subsequent years. With a CJ system, there is nothing within the assessment system which can provide that insight. That is, CJ does not allow anyone to learn from the assessment.
- David Didau October 25, 2016 at 5:49 pm - Reply
  
  You’re right – this is indeed another misconception. Of course judging in itself cannot provide feedback any more than any other means of assessing the quality of work provides feedback. This is a category error. All an examiner’s report provides is an explanation of inferential bias rather than anything inherently meaningful.
  
  However, the process of judging is more amenable to providing feedback than assessment using a mark scheme as the judge is one step closer to the work itself rather than messing about with a proxy. Then, once worked has been judged a teacher can give as much or as little feedback as they please.
  - Gravely A October 27, 2016 at 8:55 am - Reply
    
    The misconception is, I’m afraid, yours in this case. When an individual assesses a piece of work against a mark scheme, they are generating the skeleton of some feedback (though whether they choose to flesh it out and share it is up to them and the system in which they work). When 100 judges compare various pieces of work against their vague societally shared notion of say “creative” (or whatever criterion they are judging against), they simply make binary decisions. The individual judges is in no position to explain why one piece of work scored better than another and thus provide feedback.
    
    By talking of ‘the judge’, you conflate individual judges (not a single one of whom may have ranked the work in the same order as the final ranking) with the hive mind of the 100 judges. And by separating the judges judging from the teacher giving feedback one both increases the workload involved in the process (something cj aims to reduce) and might lead to the production of feedback at odds with the judging (that is, teachers may give feedback which could result in lower scores or rankings).
    - David Didau October 28, 2016 at 1:52 pm - Reply
      
      Sorry to contradict you, but the misconception is most certainly yours. When an individual assesses a piece of work against a mark scheme they are reducing the vastly complex realm of expert performance to a few vague bullet points of indicative content. If you feel this constitutes or provides useful feedback then you’re mistaken. If it is feedback it is of the most impoverished kind. Mark schemes are by their nature reductive and cannot ever adequately explain what constitutes quality work.
      
      You’re correct to say a binary decision does not and cannot explain the decision and that is the point. Attempting to reduce performance to something that can be clearly articulated is a mistake. You’re wrong to imagine I’m conflating an individual with an aggregate of judgements. I’m not at all. (Better perhaps to refer to judgements rather than judges?) When judgements show that one piece of work is consistently considered better than another then we are in a position to taken and give valuable and useful feedback about why this might be the case. Why you think this might add to workload is beyond me. When I’ve overseen this process with teachers and students alike they are able to both make judgements and better understand what constitutes quality work in a fraction of the time it might take to ‘mark’ an essay, explain why the essay was awarded this mark and then offer advice on how to achieve a higher mark.
      
      But don’t take my word for it: have a go at comparing CJ with traditional marking and see how liberating it is for staff and students.
Gravely A October 28, 2016 at 2:49 pm - Reply

I have used and abandoned CJ because both staff and students have no way of knowing why a particular pair of pieces of work have been judged in their relative order. The only feedback that CJ gives is a mark (or rank ordering) and no individual human being can even theoretically justify that mark. When student A asks ‘why did I get a 48, while B got a 52’ the only answer from CJ is ‘the computer says so’. If I might borrow your ad hominem tone for a moment: if you think that constitutes ethical and appropriate assessment, then you’re mistaken.

So in order to provide feedback, the staff then have to read all the pieces of work and try to produce post-hoc justifications for why piece A was ranked lower than piece B. That is, they are putting in the same amount of effort as they would have done previously to marking the work, but they have no way of knowing that their post-hoc justification for the relative grade will result in feedback to the student which will improve performance. They might end up telling student A that they might have got a low score because they didn’t draw their argument together to give a definite conclusion, but we have no idea if the CJ hive mind does indeed value definite over hedged conclusions.

So one ends up with two processes: one to provide the judgement (and the most degenerate level of feedback in the form of an unjustifiable mark) and one to provide more useful feedback. That’s why we found it more work. Worse, the CJ process is totally encapsulated from the feedback one: those providing feedback cannot know what factors were important or unimportant in the development of the judgements.

I agree that mark schemes can be reductive and lead to shallow and procedural thinking being rewarded, but that’s a problem which is better solved by better mechanisms for assessing and providing feedback on work, not by replacing it with a black box judgement procedure informationally divorced from any but the most low level and unjustifiable feedback.
- David Didau October 28, 2016 at 10:27 pm - Reply
  
  OK, well I’m sorry you haven’t found CJ useful. I continue to find it an excellent way of doing something impossible to do with conventional marking.
  
  You’re right to say it isn’t a tool designed to give feedback. Neither does it make cups of tea or do the vacuuming.
Craig Westby October 31, 2016 at 1:55 pm - Reply

Very interesting to read the comments; both for and against CJ. Our school is about to trial CJ across the school (KS1 & KS2) though, I must admit, there are still many unanswered questions floating around my mind.

The most important one: how do you know if a child is making progress?

Now, is cj designed to answer this question? I’m sure (and hope) it’s yes! But if you, David, or anyone else wouldn’t mind clarifying my understanding I would be most grateful…

As I understand we can set ‘anchors’, or ‘standards’ for expectations of writing. When ch are judged cj will give a ranking in relation to the class. Will it also provide info against these anchor points?

So if child x is ranked last in class in the Autumn term and then again in the Spring, yet writing has improved, along with everyone else, then how would progress be measured? I’m sure I’ve missed something so thanks in advance for helping.
Less marking, more feedback: A challenge and a proposal | David Didau: The Learning Spy December 1, 2016 at 10:30 am - Reply

[…] with using Comparative Judgement to help students get a better understanding of what excellence looks like in your […]
Ed Select Committee report – improvements to come? | How do we know? May 6, 2017 at 3:10 pm - Reply

[…] David Didau 10 Misconceptions about Comparative Judgement […]
Jen Singleton January 29, 2018 at 3:01 pm - Reply

What if teacher x is teaching a lower ability class ? Stands to reason quality of responses will be different ?
- David Didau January 30, 2018 at 9:23 pm - Reply
  
  Why does that stand to reason? I can’t see why the students you teach would warp your ability to judge the work of other students.
Are grades worth keeping? – David Didau November 15, 2020 at 12:32 pm - Reply

[…] terminal assessment. If we go back to the process of reify – quantify – rank (comparative judgement is an interesting attempt to rank without reification) these things aren’t inherently bad. If […]