Testing, testing… why one test can’t do everything

The thing which most seems to rile people about testing is the fact that it puts children under stress. A certain amount of stress is probably a good thing – there’s nothing as motivating as a looming deadline – but too much is obviously a bad thing.

Martin Robinson writes here that

… a teacher needn’t pass undue exam stress onto her pupils, and a Headteacher needn’t pass undue stress onto her teachers. People work less well under a lot of stress; by passing it down the chain, each link ceases to function so well. Therefore if a school wants to perform well, they should do a lot to take the pressure off. This is not done by telling children they needn’t be stressed by tests .. it is done by letting the tests come and go with as little rancour as possible. How the tests have been introduced by the DfE and the content of the said tests is open to question but increasing panic throughout the system doesn’t help. How to not panic so much? Well, maybe more testing, low stakes, as part of regular teaching and learning could help.

All this is, of course, true. I wrote here that it’s not testing, but the stakes which cause stress. Another problem is that a test is only as good as the purpose for which it is designed. The Key Stage 2 SATs pose a particular problem: what are they for? Are they to assess how well children have mastered the curriculum, or to hold schools to account for how well they’ve covered the curriculum? Or both?

We all know that tests should be both valid and reliable, right? It’s even more important to understand that validity is not actually a property of a test, but rather of the inferences we make about test outcomes. Dylan Wiliam tells us that “test users have a responsibility to determine whether the validity evidence produced by the developer does in fact support the chosen use of the test.” In other words, are we using tests for the purpose for which they were designed? If not, then it’s our own fault if we get meaningless results. If we’re using a test designed to test children’s knowledge to hold schools to account we’re bound to end up with invalid information. And the higher the stakes, the more invalid the information is likely to be.

The reliability of a test is just as important as its validity. We need to trust that results allow us to make meaningful inferences about students’ knowledge, skills, and abilities. Essentially, the higher the stakes the more important precise measurement becomes. This is particularly important when test results are presented as grades, bands or levels. Generally speaking we don’t seem to be keen on ‘spurious precision’. If a student gets 88% on a test we’re likely to believe they’re more able that someone who gets 87%. But this might not be true: these small differences tell us relatively little about a person’s ability. Equally problematic is ‘spurious accuracy’. We’re likely to infer that a student with a B is more able than a student with a C, but this difference in grades might only represent a difference of 1 mark or less. As long as this information isn’t being used to blight anyone’s employment prospects then there’s no problem, but the more the results matter, the more likely we are to be confused when we draw conclusions. If a cohort of students get 87% instead of 88% then we’re unlikely to notice. But if they all get Cs instead of Bs, suddenly someone’s job is on the line.

The answer is not to use SATs (or GCSEs) to hold schools to account. Tim Oates suggests intelligent sampling maybe the way forward: “The Finnish State has a history of testing too: tests from the centre, not to all children but to a sample, for the state to make judgements about the quality of schooling in the country.” There’s no need to test all students, just a representative sample. And because a test is only taken by a sample, the results are meaningless at an individual level. Students are unlikely to be anywhere near as bothered by them as they would when taking a test with high individual stakes. Of course, for teachers and schools the stakes might be high, but as this can’t be passed on to students it won’t warp the curriculum in the way the current system does.

If I were leading a school, this is what I would do:

Identify a representative sample of students – maybe about 5% of the cohort.
Give them a benchmark assessment in September using comparative judgement to produce a ‘true score’ for students in the sample.
Identify a second, similar, sample of students.
Give them a similar test in July using some of the September scripts to anchor the judgements and show whether progress is being made across the cohort.
Act on the results! (This is the hard bit.)

All this would take half an hour of, say, 30 students’ time to take the assessments and less than an hour to make the judgements of progress. This would provide data considerably more reliable and probably a good bit more valid than that produced by the current system. The acting on it bit could take considerably longer.

Read this for a taste of what comparative judgement feels like.

UPDATE: Just read this from Jack Marwood, which is a much more thoughtful proposal on how we might use sampling for accountability.

7 Comments

Arthur Rubin May 17, 2016 at 6:34 pm - Reply

GOSH, use a Stratified Random Sample? Apply a scientific approach? What a silly notion! We might ACTUALLY save money and time, reduce teachers’ workload and student stress, and learn something meaningful! Certainly, this would never work. Not enough profit for big test publishers.
David May 17, 2016 at 7:43 pm - Reply

With the SATs (and I would argue the AP tests as well), there is a question of what they actually measure. See the item from today’s Inside Higher Education by Ben Paris on the new SAT here: https://www.insidehighered.com/views/2016/05/17/implications-changes-sat-essay
- John Hodgson May 18, 2016 at 9:34 am - Reply
  
  The US SAT has always been disconnected from the high school curriculum and is supposed to test college preparedness, which is why (as Ben Paris says) vocabulary questions have always been prominent. Now, with the Common Core and the revised SAT, the US is edging towards national curriculum and testing but is still nothing like our system … and the changes there are opposed by both left and right!
  
  Generally, testing should be for a clearly defined purpose and assess knowledge in a way that is appropriate to the subject and to the student. Many of us feel that the UK SATS, especially in English, fail on most if not all of these criteria.
jamesglassecreative May 18, 2016 at 9:40 am - Reply

I can’t see Tick, Box and Miss Management at the ministry buying into it. This common sense approach to reducing teacher work load is far too radical. This is the sort of thing that a free, independent thinker might come up with and heavens – if it was instituted there would be a danger of encouraging actual thought and analysis across the system. Children might even be encouraged to enjoy learning. Finland? What could those pesky Europeans ever tell us about education when we have the US’ charter school model to follow with all its bankrupt schools and unscrupulous entrepreneurs trousering the profits.
Weekly Digest #12: How to Make Standardized Testing Better — The Learning Scientists | Dieogo's Blog May 31, 2016 at 6:38 am - Reply

[…] Testing, testing… why one test can’t do everything by David Didau, […]
Dave Thomson July 22, 2016 at 9:42 am - Reply

I can certainly understand taking a sampling approach to measure the system at a national level. This is the way international tests work, e.g. PISA firstly selects schools by stratified sampling and then randomly selects pupils from within sampled schools. But because pupils within schools are likely to be similar, you would need to test a much larger number of pupils than you would if you selected pupils at random from the population in order to achieve a reasonably precise national measure of attainment (the logic is that its easier to collect data on a larger number of pupils from a smaller number of schools than a smaller number of pupils from a larger number of schools). And of course you would need larger samples still if you wanted decent measures for sub-groups of the population (e.g. disadvantaged pupils).

What I’m less clear about is sampling within a school. Let’s imagine you sample 30 kids from an average sized secondary cohort of 180. There are a lot of potentially different samples of 30 you could draw (I appreciate you talk about samples of ‘similar’ pupils but this is easier said than done). Unless you test the same kids in September and July there could be fairly sizeable differences purely as a result of sampling variation. Why not just test the whole cohort?
- David Didau July 22, 2016 at 10:09 am - Reply
  
  You might be right Dave – why not test the whole cohort? I think the reason against is that there’s little stake for students in non-certificated exams and so it seems unfair to spend significant periods of school preparing students for a test that is only used for accountability purposes. Sampling might prevent this kind of perverse incentive.