The Unit of Education

//The Unit of Education

If you cannot measure it, you cannot improve it.

Lord Kelvin

A lot of education research is an attempt to measure the effects of teaching (or teachers) on learning (or pupils.) But is this actually possible?

Let’s first think about measurement in a very practical sense. Schools limit admission based on a sometimes very strict catchment area – if you want to make sure that your children attend a particular school you need to live within the catchment. For some very oversubscribed schools this can be a radius of less than a mile. If I measure the distance between my front door and the school I would like my daughter to attend I need some agreed unit of measurement for my reckoning to mean anything; the local authority won’t be interested in, “It’s quite close.”

In order to work out how close, we agree on a measurement system and measuring devices which enable us to define the criterion of being within or outside the catchment area. However, when it comes to measuring concepts such as progress, or learning, or teacher effectiveness, things become much more complicated. We still feel the urge to convert things into numbers, but often there is little agreement. We think we’re being precise when bandying about such numbers, but really they’re entirely arbitrary

Remember the scene from the film Spinal Tap where guitarist Nigel Tufnel proudly demonstrates a custom-made amplifier whose volume control is marked from zero to eleven, instead of the usual zero to ten? Nigel is convinced the numbering increases the volume of the amp, “It’s one louder”. When asked why the ten setting is not simply set to be louder, Nigel is clearly confused. Patiently he explains, “These go to eleven.”

And how often have you heard an over-enthusiastic school leader exhort teachers to give 110%?

Now I say all this because it strikes me that we have no agreed measure of impact or progress in education. Because there is such a heavy emphasis on ‘progress’ it has become desirable to find ways of measuring it. One such measure, which has been widely gobbled up, is the effect size (ES). This is a way of quantifying the magnitude of the difference between two groups, allowing us to move beyond simply stating that an intervention works to being able to the more sophisticated consideration of how well it works compared to other interventions. John Hattie has done much to popularise the effect size, analysing a thousands of studies by imposing the same unit of measurement on them. This begs the question, does the effect size give an accurate and valid measure of difference?

In order to this question we need to know what an ES actually corresponds to i.e. What is the unit of education? An ES of 0 means that the average treatment participant outperformed 50% of the control participants. An effect size of 0.7 means that the average participant will have outperformed the average control group member by 70%. The baseline is that a year’s teaching should translate into a year’s progress and that any intervention that produces an ES of 0.4 is worthy of consideration.[i] [See Dylan Wiliam’s correction below.]

Australian education professor, John Hattie went about aggregating the effects of thousands of research studies to tell us how great an impact we could attribute to the various interventions and factors at play in classrooms.

This is what he found:

Screen Shot 2015-01-08 at 08.51.52So now we know. Giving feedback is ace, questioning is barely worth it, and adjusting class size is pointless. You might well have a problem with some of those findings but let’s accept them for the time being.

Hattie then goes on to make this claim:

An effect-size of d=1.0 indicates an increase of one standard deviation… A one standard deviation increase is typically associated with advancing children’s achievement by two to three years, improving the rate of learning by 50%, or a correlation between some variable (e.g., amount of homework) and achievement of approximately r=0.50. When implementing a new program, an effect-size of 1.0 would mean that, on average, students receiving that treatment would exceed 84% of students not receiving that treatment.[ii]

Really? So if ‘feedback is given an effect size of 1.13 are we really supposed to believe that pupils given feedback would learn over 50% more than those who are not? Is that controlled against groups of pupils who were given no feedback at all? Seems unlikely, doesn’t it? And what does the finding that Direct Instruction has an ES of .82 mean? I doubt forcing passionate advocates of discovery learning to use DI would have any such effect.

At this point it might be worth unpicking what we mean by meta-analysis. The term refers to statistical methods for contrasting and combining results from different studies, in the hope of identifying patterns, sources of disagreement, or other interesting relationships that may come to light from poring over the entrails of qualitative research.

The way meta-analyses are conducted in education has been nicked from clinicians. But in medicine it’s a lot easier to agree on what’s being measured: are you still alive a year after being discharge from hospital? Lumping the results from different education studies together tricks us into assuming different outcome measures are equally sensitive to what teachers do. Or to put it another way, that there is a standard unit of education. Now, if we don’t even agree what education is for, being unable to measure the success of different interventions in a meaningful way is a bit of stumbling block.

And then to make matters worse, it turns out the concept of the ‘effect size’ itself may be wrong. There are at least three problems with effect sizes. Dylan Wiliam points to two problems: the range of children studies and the issue of ‘sensitivity to instruction’ and Ollie Orange suggests another: the problem of time.

Firstly, the range of achievement of pupils studied influences effect sizes.

An increase of 5 points on a test where the population standard deviation is 10 points would result in an effect size of 0.5 standard deviations. However, the same intervention when administered only to the upper half of the same population, provided that it was equally effective for all students, would result in an effect size of over 0.8 standard deviations, due to the reduced variance of the subsample.[iii]

Older children will show less improvement than younger children because they’ve already done a lot of learning and improvements are now much more incremental. If studies are comparing the effects of inventions with six year olds and sixteen year olds and are claiming to measure a common impact, their findings will be garbage.

The second problem is how do we know there’s any impact at all? To see any kind of effect we usually rely on measuring pupils’ performance in some kind of test. But assessments vary greatly in the extent to which they measure the things that educational processes change. Those who design standardized tests put a lot of effort into ensuring their sensitivity to instruction is minimised. A test can be made more reliable by getting rid of questions which don’t differentiate between pupils – so if all pupils tend to get particular questions right or wrong then they’re of limited use. But this process changes the nature of tests: it may be that questions which teachers are good at teaching are replaced with those they’re not so good at teaching. This might be fair enough except how then can we possibly hope to measure the extent to which pupils’ performance is influenced by particular teacher interventions?

The effects of sensitivity to instruction are a big deal. For instance, it’s been claimed that one-to-one tutorial instruction is more effective than average group-based instruction by two standard deviations.[iv] This is hardly credible. In standardised tests one year’s progress for an average student is equivalent to one-fourth of a standard deviation, so one year’s individual tuition would have to equal 9 years of average group-based instruction! Hmm? The point is, the time lag between teaching and testing appears to the biggest factor in determining sensitivity to instruction. Outcome measures used in different studies are unlikely to have the same view of sensitivity to instruction.

The third problem is one of the time it takes to teach. Let’s say we decide to compare two teachers using identical teaching methods, teaching two classes of children of exactly the same age. We test both classes at the start of a unit of work and at the end to see what impact the teaching has had. If children in both classes made identical gains, what would such a comparison tell us? Superficially it appears we’re comparing like with like but if it takes the first class one week to learn the material and the second class two weeks to learn the material, then any such comparison is meaningless. The Effect Size would calculate both teachers as equally effective, but if the results are the same, one class learned twice as fast as the other. Any proper unit of eduction would need to account for the time it takes for students to learn a thing.

In Hattie’s meta analysis there’s little attempt to control for these problems. This doesn’t mean we shouldn’t trust that those things he puts at the top of his list don’t have greater impact than those at the bottom, but it does mean we should think twice before bandying about effect sizes as evidence of potential impact.

When numerical values are assigned to teaching we’re very easily are taken in. The effects of teaching and learning are far too complex to be easily understood, but numbers are easily understood: this one’s bigger than that. This leads to pseudo-accuracy and makes us believe there are easy solutions to difficult problems. Few teachers (and I certainly include myself here) are statistically literate enough to properly interrogate this approach. The table of effect sizes with its beguilingly accurate seeming numbers has been a comfort: someone has relieved us of having to think. But can we rely on these figures? Do they really tell us anything useful about how we should adjust our classroom practice?

A mix of healthy scepticism and a willingness to think is always needed when looking at research evidence, but assigning numerical values and establishing a hierarchy of interventions is probably less than useful.


[i] Robert Coe, It’s the Effect Size, Stupid: What effect size is and why it is important

[ii] John Hattie, Visible Learning

[iii] Dylan Wiliam “An integrative summary of the research literature and implications for a new theory of formative assessment.” In H. L. Andrade & G. J. Cizek (Eds.), Handbook of formative assessment (2010)

[iv] Benjamin S Bloom, “The Search for Methods of Group Instruction as Effective as One-to-One Tutoring”, Educational Leadership (1984)

2015-01-10T21:12:36+00:00January 8th, 2015|Featured|


  1. […] Read more on The Learning Spy… […]

    • mollyailsa January 9, 2015 at 7:57 pm - Reply

      I’m from Ireland, and there are a lot of problems with the education system here – mainly the short-sightedness of the Department of Education. For example, I’m in 6th year (final year) and am all ready to sit my Leaving Certificate (final exams in Ireland) but have just been told that I have to repeat a year on the grounds that I am too young, having only done my Junior Cert last year (3rd year exams). I’ve done all of the coursework and will be 16 by the time of the exams – I see this as being ageist. I set up a petition to let me sit my exams this summer – I’d be really grateful if you could take a look.

  2. Michael Rosen January 8, 2015 at 9:50 am - Reply

    This is really important. Thanks so much for putting this up. I listen to that stats programme on Radio 4. There is an argument for saying that we live in an era of bullshit maths. Government and any old think tank can produce some figures and people in the media will recycle them without putting them through the kind of analysis you’ve done here. People seem to have forgotten that at the heart of all research on human beings is the problems of a) controlling all the variables bar the one you’re looking at b) really comparing like with like c) making tests to so ‘reliable’ they cease to be ‘valid’ – part of which is that the more you narrow down the field the less able the test can be taken to prove very much – and so on.

    As for testing ‘effects’ – it isn’t even that easy clinically. I take thyroxine for an under active or non-existent thyroid gland. I go for blood tests. The tests test what my pituitary gland does when it ‘notices’ that x amount of thyroxine (which comes into my body from pills, not from my thyroid) is circulating in my blood stream. The docs have a ‘window’ which delineates what is ideal pituitary function. Too high, the pituitary output is depressed. Too low, the pituitary is over-active. All fine. However, the one thing that this test doesn’t test is how thyroxine is being taken up or not being taken up in the muscle tissues – which is actually all that matters! So clinicians are testing one effect (basically a biofeedback effect) and not the effect which is the ‘purpose’ of having thyroxine in your body in the first place!

    I often see this an analogy with the way some educational research works!

    • David Didau January 8, 2015 at 10:42 am - Reply

      This is my problem with education research – it’s not reliable, valid and what it finds is nether meaningful or measurable. I’m much more interested in research conducted in psychology labs – variables can be controlled to produce meaningful and reliable results. We can then use these results to make meaningful and (perhaps) measurable predictions about what is likely to work in classrooms.

      Thanks for your comment Michael

  3. ollieorange2 January 8, 2015 at 9:59 am - Reply

    The ‘Effect Size’ also doesn’t take time into account either.
    If you look at two teachers, teaching two classes. You test both classes at the start of a unit of work and at the end. Both classes make exactly the same progress. However, it takes the first class one week to learn the material and the second class two weeks to learn the material. The ‘Effect Size’ would calculate them both as equally effective. However, I would say that the first class was learning twice as fast.
    Any proper unit of eduction would need to be ‘you learned this much per this length of time’, much like miles per hour.

    • David Didau January 8, 2015 at 10:39 am - Reply

      Excellent point Ollie – of course! Thanks – I’ve mow ammended to include this in the main body of the post.

  4. Frederick Sandall January 8, 2015 at 11:08 am - Reply

    The issue with measuring impact of teaching is that is never going to be an exact science. In addition to what Michael Rosen states it is also so dependent on contextual factors which will vary from minute to minute anf from student to student. Even when we think we have ironed out all the other factors on the list above, for some students the least impact, which is class size will be the one factor that matters most etc.

    • Beccy January 10, 2015 at 6:04 pm - Reply

      I agree. The problem with any research is that it only solves your problems if you sit in the middle of the bell curve. So somebody with high levels of side effects from medications may be pressurised to take medications by their doctor because evidence says they work, but for them the side effects may outweigh the benefits. For example stopping them from working when the outcome they were hoping treatment to achieve was staying in work. Many teachers are really good at accommodating the needs of students on the edges of different spectrums. For example Introversion/ extroversion, fine motor skills, Autism/emotional sensitivity, gaining understanding from one source/multiple sources or reading/talking it through. The research will not allow for any of this subtlety and an inexperienced leader may enforce its use against the best interests of some people. And whilst the mantra about all people all of the time may have some weight the people on the edges of the bell curve have often created work of incredible value to society and that is not something we can afford to lose. So once again we need to be careful not to lose sight of the importance of complexity.

  5. Dylan Wiliam January 8, 2015 at 11:12 am - Reply

    A couple of points. First, when you say “An effect size of 0.7 means that the average participant will have outperformed the average control group member by 70%” this is not quite correct. An effect size of 0.7 means that the experimental group outperformed the control group by an average of 0.7 standard deviations. To put this in context, an average person in the experimental group would be at the 76th percentile of the control group, or, to put it another way it would take an average student just into set 1 of 4.

    Ollie Orange quite correctly points out that the effect size does not take into account the time interval. In measurement terms, it’s a distance measure, not a rate. However, this is not a problem with meta-analysis per se, because careful researchers can, and do, include the duration of the intervention as a moderating variable. It is a problem when the research studies don’t make clear the duration of the experiment, or people do meta-analysis without including the duration of the experiment as a moderating variable.

    Finally, in case anyone’s interested here is the full Kelvin quote:

    “I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be.”

  6. David January 8, 2015 at 12:10 pm - Reply

    Another fantastic post. In addition to Kelvin’s quote at the top, I would add that you can’t measure something without changing it (Goodhart’s Law IIRC). Specifically, when you compare test outcomes of a class which is totally unobserved by researchers – a class in ordinary time – and compare that to the outcomes of a class which is involved in a research trial, you can’t pretend that the quality of teaching is the same, save for one thing such as the amount of feedback given. Everything about a class will be better is there are strangers observing. Ever had a lesson as part of a job interview with the board of governors sitting at the back?

    Another point worth stressing is that medical research in the form of clinical trials – absolute best practice evidence based research – requires a control group who take a placebo and this seems to be too sensitive a decision for school managers to take – to deliberately not improve the learning for a control group of young people.

    • Dylan Wiliam January 9, 2015 at 2:35 pm - Reply

      David’s point about placebos doesn’t go far enough. If you wanted to emulate a medical double-blind placebo-controlled randomized trial in education you would need to get half of the teachers to implement your chosen intervention, and get the other half of the teachers to implement a different innovative programme that was known to be ineffective. And you would also make sure that those using the ineffective programme believed that it was effective…

      • Christian Moore January 21, 2017 at 8:32 am - Reply

        Dylan you are correct, but for emphasis it’s also worth mentioning that in a clinical trial, a sugar pill really will be ineffective at treating a disorder and will also be ineffective at causing any other change in the body that isn’t the placebo effect.

        But in an education setting any intervention is likely to have some sort of effect, be it negative or positive.

  7. Gerald Haigh (@geraldhaigh1) January 8, 2015 at 8:49 pm - Reply

    David what you’ve done here is pin down beautifully what to me has always been the elephant in the room when it comes to educational research, which is that there are simply too many intervening variables for any of it to be either valid or reliable. The best we can hope for is that there’ll be a general indication that this or that seems to work most of the time. I sometimes wonder whether researchers really know what teachers and children and classrooms are actually like.

  8. Dominic Salles January 9, 2015 at 12:06 am - Reply

    The bigger your sample size, the more likely your variables are to cancel each other out. If I know it is three times further to get to York than London, but I don’t know how many miles it is to either, does that matter? With limited fuel, I set off for London. Does it matter that I might only get to Ealing instead of Leicester Square? No, you started out from Bristol, and you’ve done pretty well with the information available. It was very useful, even if not complete. That’s the worst case scenario of effect sizes.

  9. Dylan Wiliam January 9, 2015 at 2:38 pm - Reply

    I don’t understand this point. The effect size will be exactly the same with a bigger sample, but your chances of getting a significant result will be higher. Or, to put it another way, your confidence interval for the effect size will be narrower.

    • Dominic Salles January 10, 2015 at 4:54 pm - Reply

      That’s essentially my point Dylan. Because research is aggregated in meta studies by the likes of Hattie and the Sutton eductional trust, we can have a lot more confidence in the effect sizes they propose. David Didau’s position seems to be that because we cannot control all the variables in measuring, in my analogy, the distance from Bristol to London or Bristol to York, then we cannot trust any of the measurements. My point is that we can invest a lot of trust in the measurements – the effects sizes – because the confidence interval will be small – the equivalent of the distance from Ealing to Leicester Square on a journey from Bristol to London. I thought David would enjoy the metaphor from a fellow English teacher.

      • David Didau January 10, 2015 at 5:03 pm - Reply

        I’ve said nothing in this post about our inability to control variables. My problem, to use your analogy is that if the distance from Bristol to London is measured as 118 miles and the distance from Bristol to York is measured as 221 miles then we can know that the difference between these distances is, roughly, 103 miles.

        If the distance between B and L is measured using my stride and B to Y is measured using your stride we can still infer that the distance to Y is greater but we have a much more imprecise measure. (This speaks to the pseudo-accuracy of effect sizes.)

        But if the first distance is measured using biffins and the second distance uses pellets, we’re buggered.

        • Dominic Salles January 15, 2015 at 7:15 pm - Reply

          Yes, I agree with your second paragraph. That is my point. Meta studies give us a best guess. I don’t see that we are measuring using completely different scales, as your final paragraph implies.

          Besides, the alternative is to argue that we can therefore measure nothing of value unless we can have a precise control group. Therefore we cannot distinguish effective teaching from ineffective. Might as well leave it all to the politicians and Ofsted and the man in the street, they will have just as much chance of being right as anyone else?

      • Dylan Wiliam January 15, 2015 at 4:43 am - Reply

        Actually, the effect sizes proposed by Hattie are, at least in the context of schooling, just plain wrong. Anyone who thinks they can generate an effect size on student learning in secondary schools above 0.5 is talking nonsense.

        • ingotian January 15, 2015 at 10:16 am - Reply

          Well I’m sure you could get an effect size of 0.5 in an individual under certain circumstances but it is very unlikely to be achievable in a sufficiently large population to draw generalised conclusions about education. I think stating effect size to two decimal places gives away the lack of rigour in thinking about precision 🙂

        • David Didau January 15, 2015 at 5:23 pm - Reply

          Really? Can you explain why?

        • Dominic Salles January 16, 2015 at 9:10 am - Reply

          I’d like to know more about this. Take a school that does GCSE media studies, beginning and ending in year 9. The cohort is made up of students entering on level 4 and 5, in approximately 1:1 ratio. There are 100 of them, who opt for this. They take the GCSE at the end of year 9 and 90% of them get C to A* – the full range of grades. Their grades exceed those gained nationally by students taking Media studies in year 11, when they have had two years of teaching and are two years older.

          What is a likely effect size?
          Hattie says Acceleration = d 0.88
          You say 0.5 is a ceiling in a real life setting of the whole school year.

          Which is closer to the likely effect size in the real life scenario above?

          • ingotian January 16, 2015 at 11:07 am

            This demonstrates the problem. If we started teaching GCSE in Year 7 targeting the exam contents and moved the exam down to Y10 and got the same final results, that would be a significant effect size gain but are the pupils better educated? Are there things lost in this process? Stuff they would have been taught in KS3 that now they are not because its all focused on the GCSE syllabus and what the exam tests. Where do we take the measurements? From Y7 (or before) to Y11 or from the normal course beginning? Of course we would expect in any scientific study that there would be control groups set up to factor out these things but is this the case with all the data gleaned in metastudies? I very much doubt it. Then there is the assumption that the uncertainties are random and cancel each other out. Systematic errors are not uncommon. For example you wouldn’t need too many qualitative research projects riddled with confirmation bias in the researchers to skew the data upwards. And as I stated earlier, quoting measurements of effect size to 2 decimal places implies accuracy in the measurement to 1% or better and that is just unbelievable.

            You might be able to measure the effect on specific exam outputs in this sort of way but my understanding is these studies are a lot broader than that. They might give some broad idea that a) is more effective than b) but I’d be very cautious about putting numbers to it without having a very good idea of the precision and accuracy of the measurements and the validity of all the contributed data.

          • Dominic Salles January 16, 2015 at 2:57 pm

            Thanks Ingotian,
            I can accept all you say about the lack of absolute precision in measurement. I guess what I am also saying is, does that invalidate the comparative effect sizes?

            To me it is just as likely that all research projects will suffer from the same temptations towards confirmation bias – they would therefore also cancel each other out – it does not need to be randomness that does this.

            In the example I gave, I would want to know that the d effect size meant something – and then I could judge if it was worth trying to replicate. I think it would be. If a meta-analysis looked at 100 similar pieces of research I would be very comfortable accepting what they found as an overall effect size. The lack of precision is simply not that important in my coming to this decision – it would show the impact was positive, and large.

            The idealogical point – should students take a GCSE in one year just because they can, and taken early because GCSEs might be inherently easy – is a separate one.

          • ingotian January 16, 2015 at 6:08 pm

            I think the point about randomness and cancelling is that we don’t know that an instance of confirmation bias does not work in the same direction for some measures and not others. If the thing being studied was very tightly controlled with double blind techniques etc there might be virtually no confirmation bias at all. If it was very subjective there might be a lot and in such cases the bias would likely be to confirming the effectiveness of a method and artificially inflating its effect size. That would then exaggerate any difference between this and the first example. Point is we really don’t know how much of this might be random or systematic so we can’t assume ant cancelling out process. 100 pieces of work all affected by the same weak methods might make that element look a lot more effective than another when it wasn’t.

            Doing a GCSE in a year depends on what you are making room to do. If it was a bright child who could go on to A level work usefully, why not?

  10. @petenealon January 9, 2015 at 3:24 pm - Reply

    Interesting post. Within education of course there are variables- also the moral implication of creating the placebo effect above to a students’ learning means that will never take place in the same manner as it could within medical trials.

    Research ultimately could be used to justify anything, it could be used by poor leadership to create poor working environments or poor teachers to develop ineffective practice. However, what I think research when does as thoughtfully as it can be does offer is the opportunity for the teaching profession to look at itself with a rigour and reflection that should be benefical. The evidence produced should then be the start of the conversation as to what it might offer for the learners that the teacher is working with It can’t ever be the ultimately sign post or solution to what we do and I don’t think anyone rational would expect it to.

  11. Jane Smith January 9, 2015 at 4:41 pm - Reply

    You will find many people too SCARED to comment on this for FEAR of their jobs but they wholeheartedly agree and have experienced what you’re saying from WHERE you write about – and yes, I am talking from experience

  12. […] The first post is for David Didau and his recent post “The Unit of Education”, linked here. […]

  13. ingotian January 10, 2015 at 4:41 pm - Reply

    All it means is not everything that counts can be counted with precision – I’m sceptical that these effect sizes are accurate to 2 decimal places. Especially since the “Effect” is not clearly defined independently of context. I’d be happy if someone wanted to demonstrate the analysis to show that they are that precise and that the precision is constant across the range of measurements in all learning contexts. Very, very unlikely.

    Are the measurements entirely useless? As with all information from empirical data, there is use to be had, but you really need to understand context and uncertainty or the numbers can do more harm than good. One thing this seems to show is we really do need good maths and science education for all because clearly it is a very small proportion of the population that really can use these types of measurements sensibly. Most are likely to hi-jack them for a political agenda and claim it is “scientific”.

  14. Piers Young (@piersyoung) January 17, 2015 at 6:50 pm - Reply

    Great post – thank you, David.

    It highlighted a couple of things for me. First, it made me think again about teachers’ longing for a silver bullet. Hattie’s work is trotted out in lots of places and that’s fine. It’s accepted on face value in lots of those places, which is less fine, but understandable because – as the above thread shows – the maths behind some of this research is not 100% straightforward. I share your preference for the psychology research but am hugely aware a) in terms of understanding the research I’m an amateur, possibly like many teachers, and b) I’m likely to be swayed by mentions of words like neuroscience (e.g.

    The second follows on from the first in that, if we can’t measure things sensibly, if your average teacher is likely to veer – with the best intent – from one piece of groundbreaking research to the other, and if there is no silver bullet, then it would seem to beg the question how do we actually improve as a profession? My guess is that the best way to please Kelvin’s ghost is to make sure we pay more attention to the research on what doesn’t work.

  15. […] learning as likely to accelerate students’ learning by 5 months. This equates to an effect size of 0.36 – 0.44 which is just around the hinge point (0.4) at which Hattie declared an […]

  16. […] D. 2015. The Unit of Education. [ONLINE] Available at: [Accessed 24 January […]

  17. Assessment | Pearltrees May 9, 2015 at 8:55 am - Reply

    […] The Unit of Education. If you cannot measure it, you cannot improve it.Lord Kelvin A lot of education research is an attempt to measure the effects of teaching (or teachers) on learning (or pupils.) […]

  18. […] *I’ve critiqued the idea of effect sizes here. […]

  19. Dylan Wiliam January 21, 2017 at 8:39 pm - Reply

    David: I’ve just read your post again, and I realized I should have pointed out that meta-analysis was not borrowed from medicine. In fact, in one of the most delicious ironies, medical research borrowed it from education and psychology. It was Gene V Glass, in his presidential address at AERA in 1976, that first introduced the term, and the methods. The paper is available here: And while I’m posting, I realize that all I want to say in response to the various responses above is in chapter 3 of my most recent book, “Leadership for teacher learning”. Right now, meta-analysis is an unsound guide to effective action in education.

    • Geoff Petty January 24, 2017 at 1:50 pm - Reply

      Dylan, we all know you focus on effective formative assessment, which happens to have a high average effect size especially well designed formative assessment. I’m sure you are right to focus on formative assessment, but why not choose another factor, of which there are very many – team teaching say? Are you in any way affected by the high average effect size of formative assessment?

      I agree with David when he writes “A mix of healthy scepticism and a willingness to think is always needed when looking at research evidence”
      But I can’t agree that “assigning numerical values and establishing a hierarchy of interventions is probably less than useful.” If we can’t compare the relative effects of interventions, however crudely, all we have to replace it, is our own very limited experience, and our subjective evaluation of factors that affect achievement. Useful of course, but incomplete and biased.

      I think heirarchies can help crudely sift factors to find those we might best experiment with in order to improve our teaching, and to find those factors less likely to help. To throw out all effect size research because its crude seems immoderate, if there’s nothing better we can use to compare the factors that might affect student achievement.

      • George LILLEY June 23, 2017 at 9:10 am - Reply

        It is interesting that in David’s recent post he presents Hattie’s latest model –

        and Hattie, the king of hierarchies, is now arguing against using them –
        “There is much debate about the optimal strategies of learning, and indeed we identified >400 terms used to describe these strategies. Our initial aim was to rank the various strategies in terms of their effectiveness but this soon was abandoned. There was too much variability in the effectiveness of most strategies depending on when they were used during the learning process …”

  20. […] The Unit of Education – Measuring Teaching […]

  21. […] accounting for the real problems with taking such an approach. I’ve outlined my reservations here. Second, the picture is partial. Some of the most robust, well-replicated findings from cognitive […]

  22. […] of which I acknowledge is highly problematic and has been critiqued in numerous posts such as here, here, here, here, here, and here) that I will refer to in this post. To be entirely honest (even […]

  23. […] the intervention result in an effect size of above 0.4? I’ve written before of my scepticism about effect sizes, but Hattie’s point that everything has an effect is an important one. If pretty much […]

  24. […] For a more general critique of effect sizes and meta analyses, see this post. […]

  25. steluta bageag October 5, 2017 at 7:58 am - Reply

    There is another “parameter” that should be considered: the willingness to learn. More and more pupils (students) are not interested to learn anything. They do not read but face-book pages or other similar nets. There are real problems with reading a text at first sight and for them it`s not quite a problem. Is there that “parameter” quantifiable? I know that is a result of the modern civilized society with its aggressive politics of marketing. It is a problem of the soul. There are unnumbered proofs ( this are quantifiable) that learning does not offer a good life in terms of material goods-starting with the status of the teachers

Constructive feedback is always appreciated

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: