Who’s better at judging? PhDs or teachers?

In Part 1 of this series I described how Comparative Judgement works and the process of designing an assessment to test Year 5 students’ writing ability. Then in Part 2 I outlined the process of judging these scripts and the results they generated. In this post I’m going to draw some tentative conclusions about the differences between the ways teachers approach students’ work and the way other experts might do so.

After taking part in judging scripts with teachers, my suspicion was that teachers’ judgements might be warped by the long habit of relying on rubrics to assess students’ work. All too often we end up teaching what’s on the rubric and missing out on other, tacit, features of performance; we sometimes end up rewarding work which meets the mark scheme’s criteria even if we think it’s a bit ropey. Likewise, some students may be penalised because, although they write well, their work doesn’t obviously display the features a marker is primed to look for. I wanted to see whether other ‘experts’ – in this case PhDs in creative writing – would judge our scripts differently.

For the most part, the results were fairly consistent:

A comparison of teachers' judgements with PhD students' judgements of Yr 5 students' writing

A comparison of teachers’ judgements with PhD students’ judgements of Yr 5 students’ writing

As you can see, there are a number of outliers, four of which I’ve selected to interrogate their work in more detail. Here is a break down of their scores:

Screen Shot 2016-07-05 at 15.06.51

Let’s first compare Students 1 and 2. Both sets of judges ranked these pupils near the bottom but Students 1 was ranked more highly but the PhDs while Student 2 did better with the teachers.

Screen Shot 2016-07-05 at 16.25.40

Student 1’s writing is hard to read and as a result maybe they fell victim to a handwriting bias,but what was it that the PhDs saw that the teachers missed? For ease, here’s a typed version of Student 1’s answer:

it was(?) the goblet lay close by, so close he simply could not resist it. He snatched it before ran for his life. And even as he ran an idea came into his head. This gift for his master, the(?) goblet would be a perfect (?)

But the dragon through his perfect? scales had felt the loss of the treasures.

Conversely, Student 2’s script is much easier to read and, despite the poor sentence construction and lack of subject-verb agreement, there’s a nice, easy-to-spot simile. Were the teachers seduced by its very superficial charms? On closer inspection it does seem that while neither answer is of an acceptable standard, Student 1’s is definitely better.

Now’s let’s have a look at the two top end answers:

Screen Shot 2016-07-05 at 16.27.54

At a glance we can see an obvious lack of paragraphing in Student 3’s answer which may have biased the teachers against it. Are the PhDs right to look beyond this and reward the content? The paragraphing in Student 4’s answer is a little weak. The ellipsis means the first paragraph trails off with a disappointing lack of drama. Then, there’s a missing break between the slave escaping a flogging and the reappearance of the dragon.  Both answers have some nice touches and both are let down by moments of clunkiness. Student 3’s opening clause is well-written but undercut by “he stood up in a hurry, but silently.” Likewise, Student 4’s piece opens very strongly – “cold, selfish eyes” is excellent – but falls flat with “humongous”.

Moving from the judging mindset (fast, intuitive, tacit) to the marking mindset (slow, analytical, looking for indicative content) it’s really hard to split these two. And that’s the point: a mark scheme may give us a comforting sense of objectivity but deciding which of these two is better is a completely subjective process. On balance I prefer Student 4. Interestingly, as you can see from the scatter graph Student 3 was ranked top with the PhD students with 4 other scripts being placed above Student 4, whereas the teachers ranked Student 4’s script as the best with about 9 scripts ranking higher than Student 3.

Rather than revealing anything about the different judges, this might suggest that all these scripts are of similar quality and would end up being awarded a similar grade. The problem would come if one or other script was near a boundary. We hadn’t chosen to grade the work – we just decided on one boundary – good enough/not good enough – and made sure we were happy that the scripts above and below were in the right place.

So, what can we learn from this process? Without systematically going through the entire sample, probably nothing. Apart from the outliers above, the rest of the scripts were close enough not to result in much difference to the rank order. Maybe the one thing this does throw up is the problem of reading very messy handwriting. Judgements are meant to be made in about 30 seconds and this simply wasn’t enough time to read students 1’s response. (It took me almost 5 minutes to puzzle it out!) While we can see at a glance that it will rank near the bottom, this kind of intuitive judgement may miss important aspects of the writing. The solution might be to type up any scripts which are likely to cause problems.

My suggestion is that if judgements are to be in any way high stakes, we need to analyse the scripts that fall at any boundaries or cut offs that may be applied to a rank order. In this way we’ll get the best of best worlds: fast, intuitive judgements for all with slower, analytical processing of the scripts most likely to present challenges.