Yesterday I saw a thread on Twitter from headteacher Stuart Lock on the pros and cons of the new inspection framework:
I realised in reading the discussion on twitter today what my problems are with the new @Ofstednews framework but why I think those issues were inevitable – and why it's still the right bet.
— Stuart Lock (@StuartLock) January 12, 2020
In it he discusses the idea that because the previous inspection framework relied heavily on schools’ results in national exams in making judgements it managed to be fairly reliable. That is to say, an inspection team inspecting two schools with similar results or that two different inspection teams inspecting the same school would arrive at a broadly similar judgement. In 2015 Ofsted conducted some research on the reliability of it’s judgments (the report can be found here). Two independent inspectors carried out short inspections in 24 primary schools and in “22 of the 24 short inspections, the inspectors agreed on their independent decision about whether the school remained good or the inspection should convert to gather more evidence.” Although the report found evidence that inspectors interpreted the available evidence subjectively, the overall judgements were very reliable (0.92.)
Tucked away at the end of the report was a short section on validity. Essentially, a measurement is valid if it measures what it purports to measure. In the case of inspections, are inspectors looking at things that actually reveal the quality of the education provided by a school?
The report highlights the fact that reliability without validity is less than useful: “the positive findings from this current study will be largely irrelevant if the components of current inspection processes are found to have little association in determining school quality.” As Stuart Lock points out in his thread, if a set of scales reliably provides an incorrect weight, we’re not only ignorant but convinced we know something which, it turns out, is untrue.
In a blog from August 2019, Amy Finch, Ofsted’s Head of Strategic Evaluation suggested “there is very often a trade-off between reliability and validity.” As Stuart Lock points out, “Ofsted was over-emphasising the published outcomes in the framework, and hence Ofsted judgements were, to a greater or lesser extent, an invalid measure of the quality of education.” Therefore, one of the aims of the new inspection framework, launched in September 2019 was to increase the validity of inspections. As Amy Finch describes it, one of the principles of the new framework is that:
Exam data gives the appearance of precision but is in fact not perfectly valid or reliable, for many reasons. For one thing, exams can generally only test pupils on a small sample of what they know about a subject, which may or may not be representative of everything they know. And of course, putting too much weight on exam data can lead to undesirable behaviours in schools.
She goes on to say that while perfect reliability will always be impossible when making complex judgments in the real world, “an inspection system that was ‘perfectly’ reliable would not be looking at the right things.” This is a fascinating idea: the more reliable inspection is, the less validity the inferences of inspectors are likely to be. As Finch says, “The things that are hard to measure are often the things that matter most.”
So, with the new inspection framework’s attempt to measure the quality of education provided by a school by widening the focus on inspections to look at the curriculum while discounting the overarching importance of exam data, is there bound to be a trade-off between validity and reliability?
Ofsted certainly seem to think so and, in Finch’s blog she sets out how Ofsted are trying to ensure the reliability of inspections remains high. One effort has been to study the reliability of some of the smaller judgements that make up the overall judgement such as the reliability of lesson observations and work scrutinies. Ofsted published the results of these studies last June broadly finding that “judgement reliability improved significantly when book scrutiny and lesson observation reliability also improved.”
Trading off some reliability in favour of greater validity is probably worth the risk but that’s small comfort to schools who find themselves on the wrong side of reliability.
Yesterday I published a post on the furore around whether secondary schools are being downgraded for having a two-year Key Stage 3. This was prompted by the case of Harris Academy St John’s Wood which, despite being graded as ‘good’ would, according to various commentators, have been judged as ‘outstanding’ under the previous framework. Obviously, it’s impossible to know whether this is true as we can never see the counter-factual but what we do know is that the only criticism of the school received in the published report was around the breadth of curriculum students experience in Year 9. My attention has also been drawn to Castleford Academy – a school recently judged to be ‘outstanding’ but which, according to its website, offers a very similar curriculum to that offered by Harris St John’s Wood. Interestingly, the inspection report makes no mention at all of the school’s curriculum structure saying only, “Leaders map out carefully what pupils need to know and remember. This is done across key stages 3 and 4.”
Now, we have no way of knowing if the education offered by these two schools is broadly similar or if one is clearly better than the other but we can compare the two schools’ progress 8 data:
Is Castleford better than Harris St John’s Wood? You tell me. One thing that’s for sure is that Ofsted’s inspection reports cast no light on the matter. As Stuart ends his thread by saying, “the new reports are appalling, and if it’s because they’re directed at parents, I think it is patronising! This is something I wish they’d review urgently.” This is a sentiment with which I can only agree.
When it comes to measuring reliability, Ofsted’s way of working has always presented a particular challenge. Whilst variations have been made to the Ofsted framework for school inspections over time, one enduring feature of inspection practice has been a commitment to enabling inspectors to exercise their professional judgement when making evaluations of the data they have collected. Ofsted have frequently emphasised how important the role of expert judgement is when inspectors use the inspection handbook, often stating that it should not be regarded as a set of inflexible rules, but as an account of the procedures that govern inspection.
The challenge is therefore to determine what is meant precisely when speaking of the reliability of Ofsted inspections when it is wholly feasible for two inspectors to use their expert judgement to reach legitimately different views on the overall outcome of an inspection.
Ofsted has of course, tried to equate consistency of judgement with reliability. A test of the consistency of inspectors’ decisions and judgement during short inspections was carried out by the inspectorate in March 2017. Although Ofsted proclaimed that the results of this test showed a high level of consistency, and hence reliability, the veracity of this assertion has been shown to be weak, see link below.
Much progress is therefore unlikely to be made until a clear definition is established of what reliability means in relation to the complex evaluations made during Ofsted inspections.
https://www.researchgate.net/publication/327894743_A_review_of_Ofsted's_test_of_the_reliability_of_short_inspections