The thing which most seems to rile people about testing is the fact that it puts children under stress. A certain amount of stress is probably a good thing – there’s nothing as motivating as a looming deadline – but too much is obviously a bad thing.

Martin Robinson writes here that

… a teacher needn’t pass undue exam stress onto her pupils, and a Headteacher needn’t pass undue stress onto her teachers. People work less well under a lot of stress; by passing it down the chain, each link ceases to function so well. Therefore if a school wants to perform well, they should do a lot to take the pressure off. This is not done by telling children they needn’t be stressed by tests .. it is done by letting the tests come and go with as little rancour as possible. How the tests have been introduced by the DfE and the content of the said tests is open to question but increasing panic throughout the system doesn’t help. How to not panic so much? Well, maybe more testing, low stakes, as part of regular teaching and learning could help.

All this is, of course, true. I wrote here that it’s not testing, but the stakes which cause stress. Another problem is that a test is only as good as the purpose for which it is designed. The Key Stage 2 SATs pose a particular problem: what are they for? Are they to assess how well children have mastered the curriculum, or to hold schools to account for how well they’ve covered the curriculum? Or both?

We all know that tests should be both valid and reliable, right? It’s even more important to understand that validity is not actually a property of a test, but rather of the inferences we make about test outcomes. Dylan Wiliam tells us that “test users have a responsibility to determine whether the validity evidence produced by the developer does in fact support the chosen use of the test.” In other words, are we using tests for the purpose for which they were designed? If not, then it’s our own fault if we get meaningless results. If we’re using a test designed to test children’s knowledge to hold schools to account we’re bound to end up with invalid information. And the higher the stakes, the more invalid the information is likely to be.

The reliability of a test is just as important as its validity. We need to trust that results allow us to make meaningful inferences about students’ knowledge, skills, and abilities. Essentially, the higher the stakes the more important precise measurement becomes. This is particularly important when test results are presented as grades, bands or levels. Generally speaking we don’t seem to be keen on ‘spurious precision’. If a student gets 88% on a test we’re likely to believe they’re more able that someone who gets 87%. But this might not be true: these small differences tell us relatively little about a person’s ability. Equally problematic is ‘spurious accuracy’. We’re likely to infer that a student with a B is more able than a student with a C, but this difference in grades might only represent a difference of 1 mark or less. As long as this information isn’t being used to blight anyone’s employment prospects then there’s no problem, but the more the results matter, the more likely we are to be confused when we draw conclusions. If a cohort of students get 87% instead of 88% then we’re unlikely to notice. But if they all get Cs instead of Bs, suddenly someone’s job is on the line.

The answer is not to use SATs (or GCSEs) to hold schools to account. Tim Oates suggests intelligent sampling maybe the way forward: “The Finnish State has a history of testing too: tests from the centre, not to all children but to a sample, for the state to make judgements about the quality of schooling in the country.”  There’s no need to test all students, just a representative sample. And because a test is only taken by a sample, the results are meaningless at an individual level. Students are unlikely to be anywhere near as bothered by them as they would when taking a test with high individual stakes. Of course, for teachers and schools the stakes might be high, but as this can’t be passed on to students it won’t warp the curriculum in the way the current system does.

If I were leading a school, this is what I would do:

  1. Identify a representative sample of students – maybe about 5% of the cohort.
  2. Give them a benchmark assessment in September using comparative judgement to produce a ‘true score’ for students in the sample.
  3. Identify a second, similar, sample of students.
  4. Give them a similar test in July using some of the September scripts to anchor the judgements and show whether progress is being made across the cohort.
  5. Act on the results! (This is the hard bit.)

All this would take half an hour of, say, 30 students’ time to take the assessments and less than an hour to make the judgements of progress. This would provide data considerably more reliable and probably a good bit more valid than that produced by the current system. The acting on it bit could take considerably longer.

Read this for a taste of what comparative judgement feels like.

UPDATE: Just read this from Jack Marwood, which is a much more thoughtful proposal on how we might use sampling for accountability.