Like an ultimate fact without any cause, the individual outcome of a measurement is, however, in general not comprehended by laws. This must necessarily be the case.
Wolfgang Pauli
A month or so back I met Professor Steve Higgins from Durham University’s Centre for Evaluation and Monitoring. He presented at researchED’s primary literacy conference in Leeds and what he had to say was revelatory. His talk was on the temptations and tension inherent in the EEF’s Pupil Premium Toolkit. As most readers will know, the toolkit is a bit of a blunt instrument and presents interventions in terms of how many months of progress teachers can expect to add if they have a crack at them.
This leads to all sorts of misunderstandings and mistakes. Well-intentioned school leaders leap on the top scoring inventions and confidently conclude, “Yay! If we do feedback and metacognition our students will make a whole 16 months of extra progress!” Sadly, it’s all a bit more complicated than that.
The reported impact for an intervention is an average. Research on each of the different interventions is aggregated to show a normal distribution of effects. So for an effect size of 0.8 we might get a distribution a bit like this:So, what does this actually tell us? Well, for the headline figures to be meaningful, we really have to look at the shape of the distribution to see just how good our implementation of an intervention would have to be to get an average effect. Consider this example of one of the ‘best bets’ like feedback or metacognition:
The wide distribution tells us that some studies will have shown the intervention to have fairly poor impact whereas other studies will have demonstrated extraordinary impact. The area shaded in mauve indicates the sort of impact we would have to aim at in order to get anywhere near the +8 months reported by the Toolkit. In a best bet, our intervention only has to be of average effectiveness in order to reap rewards. This helps to explain why the possible negative impacts of feedback are so powerful. As Hattie says, “Feedback is one of the most powerful influences on learning and achievement, but this impact can be either positive or negative.”
However, if we turn our attention to interventions with good, but more modest impacts, like, say, digital technology or small group tuition, both reported as providing +4 months progress, the bell curve will look more like this:
What this demonstrates is that our intervention will have to slightly better than the average implementation of this approach in order to be as worthwhile as we might want. And if our intervention goes badly, there’s actually a risk it might make a negative impact on progress.
Which brings us to some of the riskier approaches. Somewhat controversially, the EEF reports that there’s fairly robust evidence (indicated by the 3 padlocks) that implementing Learning Styles offers +2 months of progress for a very modest outlay of time and resources. That’s not too shabby, is it?
But surely Learning Styles has been thoroughly debunked and dismissed? What’s going on? Let’s have a look at the bell curve:
What this tells us is that it may actually be possible to implement Learning Styles in a way that benefits pupils’ progress. Maybe your school will be one of the lucky few. But the average effects are fairly negligible and probably not worth even modest outlays. And there, on the left-hand side of the distribution is why implementing a strategy like Learning Styles is so risky: 50% of the studies will show impacts of less than +2 months. And an unacceptably high number of studies will have reported negative impacts will actually impede pupil’s progress.
I should add that the bell curves shown in this post are from Steve’s presentation and don’t actually represent the actual distributions for the particular inventions I’ve discussed in this post. Apparently feedback has one of the widest distributions of effects whereas other interventions have much sharper peaks with less leeway either side. Dylan Wiliam provided this distribution of the 607 effect sizes of feedback found by Kluger & DeNisi in their seminal 1996 meta-analysis.
As you can see, the distribution is anything but normal with the effects averaging around 0.41, but 38% of the effect sizes were negative. If this is anything to go by it tells us our attempts to give students feedback must be very carefully thought out indeed. Doing feedback averagely well looks to be a waste of time! There are, however, some very intriguing outliers which is where further research and experimentation ought to be focussed
After listening to the presentation, I asked Steve if the actual distributions are available. He explained that when the Toolkit was being put together that it was decided that displaying the bell curves would be too complicated and would just over-burden poor, unsophisticated teachers. Nonsense!
I don’t know about you, but I said then and still think this is nonsense! Maybe it might be too complex for some but then, as I used to tell my students, nobody ever rises to low expectations. Hell, I’m only an English teacher! If I can get it, so can you! If the information was available, some of us would make an effort to understand it. If the information isn’t available then we guarantee a lowest common denominator is achieved. (This is very much the problem with differentiating resources by the way.) Happily Steve agreed and committed himself to doing something about it. I saw him again yesterday and lo! he’s in the process of making the bell curves available on the EEF website. Hopefully soon, we’ll be able to click through from the headline figures and actually examine the shape of the curve for any intervention we’re considering implementing. This is a minor triumph for teachers and might be a small step along the road to greater professionalism.
One other thing which may have added a layer of interest to the EEF material would have been to know the proportion of pupils within each study used, on the basis of age and sector. From a Special school perspective, having read the technical report, knowing that the there were only two distinct SEN journals used, equating to approximately 0.7% of the total number of studies and 0.14% of the number of pupils helped me to consider the extent to which the information may relate well to my sector. Doesn’t mean it it right or wrong, but leads to further enquiry rather than blind faith acceptance. There are, I am sure, additional sources of SEND evidence within the data, but not knowing the breakdown makes it harder to engage with the toolkit rather than easier.
I take your point Simon. I’m not defending this sort of statistical approach – just explaining it. You’re right that the Toolkit summaries are an interesting starting place for further investigation and most definitely not a final answer.
Yes, but…
You’ll have a bell curve – but it won’t be based on the subject you are teaching nor the age group you are teaching but some averaged data from a mixture of disciplines & ages. What’s more, the named intervention will differ hugely in its implementation in each experiment. I’m still underwhelmed by this entire approach to education research
Oh, of course. That’s why Higgins describes them as ‘bets’. This kind of statistical research allows us to estimate probabilities but we always need to go back to that Wolfgang Pauli quote at the beginning.
I think the Pauli quote suggests that an individual student’s outcome is not comprehended by the ‘laws’ . This would hold even if we had an experiment measuring the effects of a tightly defined intervention on a class of a particular age within a particular discipline. The meta-analyses provide nothing like that. So I can’t take the laws seriously let alone the individual outcomes. Simonknight100 (above) is interested in the effects of an intervention in a SEN environment, I am (mostly) interested in the effects on GCSE & A level maths. I suspect there is very little there to guide either of us.
There is plenty to guide us but nothing that can relied on to provide definitive answers. And this is as it should be, I think 🙂
This has to be the point for so many aspects of ‘overall findings’ in educational research. Yes – it doesn’t allow us to confidently predict individual outcomes – but – once we’ve seen it and (intelligently) absorbed what it appears to depict… does it not, nevertheless, shape our overall expectations and judgements of what might work in a slightly more astute way…?
Thanks for this David. Kevan Collins presented at our education conference in April and I thought he did a good job of both exploring the potential of the EEF toolkit and also the pitfalls of reading too much into the headline data. It is also one of the issues that we face when looking at Hattie’s work. I think that’s why Hattie’s core message of ‘know thy impact’ is important.
Your point about learning styles is a good one. I get the feeling that meta-analysis is probably most effective when doing two things: disproving an intervention or strategy, or showing the potential proportional impact between a variety of interventions.
And example of the first might be repeating a year. In Brazil, where I currently work, being asked to repeat a year of school is a common and accepted practice (though not in our school). The toolkit tells me that on average it has a negative impact. In itself this is noteworthy, as not many things actually have a average negative! So, when we discuss as a school our approach it is fair to say that making a student repeat a year would be a risky gamble. Essentially, it would almost always be a risky bet, and we don’t want to take that sort of bet without an overwhelming reason. In the second case I think this issue of proportionality is why everybody looks to formative assessment. On average it has a high impact compared to other strategies e could implement, so ‘doing it well’ is probably an effort worth taking. But ‘doing it well’ is the key to every new development.
I agree with you that the graphs are worth presenting. I also hope that over time EEF will find a way to produce more case study type materials, both of the successes and more importantly the failures, where we learn more. How long do you have to sustain an initiative before it becomes part of the culture, not the climate? What pattern of CPD works best to support development? How do we vary implementation dependent on the subject, the students, the teachers, the type of school?
It is easy to be either overexcited or skeptical about the growing influence of research. As somebody smarter than me said recently the point is not that all this research answers questions but that it allows us to ask better questions. A good school is a place of knowledge and inquiry.
Hi Thom
Hattie’s ‘know thy impact’ is a nice line, but it’s completely impossible. How could you ever control for all the variable involved in teaching to know with any degree of certainty that impact individual actions were having? Of course you’re right to say that anything can be done well, but when the available evidence suggests grade retentions has such a negative impact in most contexts one has to wonder about such statements as “on average it has a high impact compared to other strategies we could implement” How do you know? Have you ever conducted a properly controlled trial?
The effect size graph highlights something that has bothered me about the application of meta-analysis in education; are the “baskets” that studies are put in too broad to be useful? Looking at the graph, there are two ways I can see of interpreting the data…
One is that some sorts of feedback have minimal or negative effect, but others have huge positive effect. So, if we can work out what all the hugely positive studies have in common (which will be more specific than “do feedback”) and do that, we will all be on the gravy train to outstanding outcomes.
The other possibility is that the approaches in the positive and negative studies aren’t all that different. It’s just that outcomes are a bit random, and the studies were a bit underpowered to tell the difference between luck and genuine long-term improvement…
I think it’s a bit of both. There are certainly ways of doing feedback which are better or worse and we can learn what to avoid from feedback interventions that produce negative outcomes. But, I think you’re right to suggest that “outcomes are a bit random” – context will have a lot to do with ‘what works’.
Pleased to hear that the explanation to the strategies in the EEF is to be published. When choosing an intervention we must first think about our pupils and what works well for them before jumping on intervention strategy band wagon.
I don’t follow the argument about “Doing feedback averagely well looks to be a waste of time”. If we take Kluger and DeNisi’s average effect size of 0.41, and assume (a big assumption, granted) that this would be the effect size of learning with feedback over a sustained period, then this is a big effect size, equivalent to approximately doubling the rate of learning (i.e., learning in six months what learning without feedback would require a year to learn). The problem with Kluger and DeNisi’s analysis is that a mean effect size is hard to interpret when you include extreme effect sizes in your calculation. To understand how large the largest effect size of 12 found by Kluger and DeNisi actually is, it would be like taking the least able 11 year old in the country, and making her or him into the most able 11 year old in the country. But even if we remove these extreme values, the average effect of feedback is very substantial…
A fair point. Some lazy assumptions on my part. Sorry
[…] https://www.learningspy.co.uk/research/its-the-shape-of-the-bell-curve-stupid/ […]
[…] no argument with this point – you really can do anything badly or well, but as I explained here, the distribution of data matters. Doing some things well might not be as good as doing other […]
[…] is, these are almost impossible to find unless you have the wherewithal; to create your own. This is something I have suggested the EEF ought to make available. As of yet, they […]