Another one from Teach Secondary, this one from their assessment special. This time it’s an over view of Comparative Judgement.

Human beings are exceptionally poor at judging the quality of a thing on its own. We generally know whether we like something but we struggle to accurately evaluate just how good or bad a thing is. It’s much easier for us to compare two things and weigh up the similarities and differences. This means we are often unaware of what a ‘correct’ judgement might be and are easily influenced by extraneous suggestions. This is compounded by the fact that we aren’t usually even aware we’ve been influenced.

Some assessments are easy to evaluate; the answer is simply right or wrong. There can be great advantage to designing multiple choice questions which allow us to make good inferences about what students know, as well as giving us a good idea of their ability to reason and think critically. But however well designed, this type of question struggles to assess students’ ability to synthesise ideas, consider evidence and pursue an analytical line of reasoning at any length. Consequently, many school subjects assess – at least in part – through extended written answers. Evaluating the quality of an essay is a difficult job and so we produce rubrics – mark schemes – to indicate how a student might be expected to respond at different mark boundaries. Teachers then read through the essay and attempt, as best they can, to match the content to that indicated in the mark scheme.

The problem is that we are very bad at doing this. The psychologist Donald Laming says, “There is no absolute judgement. All judgements are comparisons of one thing with another… comparisons are little better than ordinal”. What this means is that we can reliably put things into a rank order, but that’s about it. Mark schemes give the appearance of objectivity but in actual fact when teachers mark a set of essays they often find that half way through they come across an essay that is much better or worse than all the ones they’d marked to that point. This results in going back to change the marks to allow for the new essay to be ranked according to its merits.

Understanding students’ performance depends on huge amounts of tacit knowledge. Because it’s tacit it’s very hard to articulate – even (maybe, especially) for experts. In our attempts to break down what experts do we spot superficial features of their performance and make these proxies for quality. For instance, it may well be that a good writer is able to use fronted adverbials and embedded relative clauses, but they would never set out with this as their goal. By looking for these proxies we run the risk of limiting both our own understanding and students’ ability.

A solution is to do away with mark schemes and use instead a system of Comparative Judgement. Judging is different to marking in that it taps into our fast, intuitive modes of thinking – what Daniel Kahneman has called System 1. Marking, on the other hand, is the slow, analytical thinking characterised by System 2.

The idea is that a judge will look at two essays at once and make an intuitive judgement on which is better. These judgements are then aggregated (the formula for reliable judgements is 5 x the number of scripts) to form a highly reliable rank order of students’ work. The advantages for teachers are that the process is not only significantly quicker than traditional marking (judgements should be made in about 30 seconds) it’s much more reliable and allows us to make better inferences about students’ ability.

There are some common misconceptions to consider: First, there are concerns about accuracy. It’s understandable why we might feel sceptical that a fast, intuitive judgement can tell us as much as slow, analytical marking. Surely spending 10 minutes poring over a piece of writing, cross-referencing against a rubric has to be better than making a cursory judgement in a few seconds? On one level this may be true. Reading something in detail will obviously provide a lot more information than skim reading it. There are, however, two points to consider. Firstly, is the extra time spent marking worth the extra information gained? This of course depends. What are you planning to do as a result of reading the work? What else could you do with the time? Second, contrary to our intuitions, the reliability of aggregated judgements is much greater than that achieved by expert markers in national exams. The reliability of GCSE and A level marking for essay-based examinations is between 0.6-0.7. This indicates that there’s a 30-40% probability that a different marker would award a different mark. Hence why so many papers have their marks challenged every year. But, if we aggregate a sufficient number of judgements (5 x n) then we can end up with reliability above 0.9. Although any individual judgement may be wildly inaccurate, on average they will produce much more accurate marks than an expert examiner.

For subject areas, where this kind of comparison can be easily made, it puts an end to conversations revolving around sub-levels of progress, or predicted grades, as if they actually meant something concrete. All assessments provide us with a proxy, this point is whether or not it’s a good proxy. CJ allows us to make better inferences about learning as an abstract thing because it’s so focussed on the concrete. The absence of rubrics means we are one step nearer the thing itself. Additionally, not having a rubric also means we are likely to get a more valid sample of students’ ability within a domain. Because a rubric depends on attempting to describe indicative content it warps both teaching and assessment; teachers use mark schemes to define the curriculum and examiners search for indicative content and ignore aspects of great work that didn’t make it into the rubric.

Another concern is that the presence of a systematic bias might reduce the accuracy of the process (after all a measure can be reliable but still be invalid). However, teacher assessments are more likely to judge the child rather than their work, as investigations into the ‘Halo effect’ have consistently shown. We are all inadvertently prone to biases, which end up privileging students based on their socio-economic background, race and gender. Whilst concerns that the seemingly irrelevant aspects of students’ work – such as the quality of handwriting – affects comparative judgement are fair, theses biases also affect every other form of marking. If anything, comparative judgement is less unfair than marking.

In an ideal world maybe teachers would put the same effort into reading students’ work as they put into creating it. Sadly, this thinking has led to the steady rise in teachers’ workload and mounting feelings of guilt and anxiety. No teacher, no matter how good they are, will ever be able to sustain this kind of marking for long. But maybe we’ve been asking the wrong question. Maybe instead we should ask, If students have put all this effort into their work, is it fair that we assess it unfairly and unreliably?

It’s worth noting that the 30-second intuitive judgement is only desirable during the judging process. When a rank order has been obtained, teachers can use the time to explore much more interesting and personal aspects of the writing, especially where judges make different judgements of the same piece of work.

In essence, comparative judgement is a way of making quick and reliable summative assessments. In order to provide meaningful formative feedback, of course you actually have to spend time reading the work too.

Tips for using Comparative Judgement

  • The low-tech version is to spread out essays on a table and shuffle them until you are happy that they are in the correct rank order. This is still hugely quicker than using a mark scheme.
  • The hi-tech version is to use the free software on nomoremarking.com to upload your essays on a computer system, which allows you not only to arrive at a rank order, but to assign scores.
  • If this all sounds a bit laborious, you can pay someone else to do all the admin for you by signing up here.
  • By doing a baseline assessment at the beginning of Year 7 and using some of the scripts from this assessment as ‘anchors’ for a second assessment you can objectively demonstrate how much progress students have made over the year.
  • You can even use comparative judgement to assess students’ ability in subjects like maths by asking open questions like “What are prime numbers for?” or “Why are brackets useful?” and then comparing students’ answers.
  • To find out more about how to use Comparative Judgement in your school, see these blogs: Proof of Progress Part 1  and Part 2.