This is the 20th and final post in my series on the Top 20 Principles From Psychology for Teaching and Learning and the third of three posts examining how to assess students’ progress: “Making sense of assessment data depends on clear, appropriate, and fair interpretation.”
“I wish we had more assessment data!” said no sane school leader ever. We’re awash with data produced by oceans of assessment. As with so much else in life, the having of a thing is not its purpose. Analysing spreadsheets and graphs becomes like gazing, dumbly, into a crystal ball. We need to know how to interpret what these data tell us. And, perhaps more importantly, we need to know what they can’t tell us. We need to know how to interpret data clearly, appropriately and fairly.
As the Top 20 reports tells us, “Scores from any assessment should generally be used only for the specific purposes for which they were designed.” Easy to say, and you might well be nodding along, but look at what we routinely do in practice. What’s the purpose of Key Stage 2 tests? Is it to assess the extent to which the Key Stage 2 curriculum has been covered? Is it to infer how much progress students have made across a Key Stage? Or is it perhaps to see if schools are ‘coasting’? How then should we use the data produced? Can we use it to decide how much to pay teachers based on their students’ performance? Can we use it to decide how to set students in secondary school? Or maybe we could use it to predict students’ GCSE results at the end of Key Stage 4?
Here’s a list of a few things assessment data might be expected to do:
- Set or group pupils according to ability
- Diagnose what individual pupils know
- Make promotion or retention decisions
- Share information with parents or government
- Measure the effectiveness of instruction (or learning)
But expecting the same test to produce data that can produce data fit for all these purposes at once is unfair and inappropriate.
The point is, the way we interpret data warps and distorts not only the assessment process, but every other aspect of education it touches. Few tests are bad in and of themselves, it’s what’s down with the data that determines whether they produce useful data.
Samuel Messick argued that ““Test validation is a process of inquiry into the adequacy and appropriateness of interpretations and actions based on test scores” and designed a framework to help us consider how we ought to inquire into the validity of tests:
Let’s look at how it works in practice. Messick suggested that one of the consequences of how tests are interpreted is that those aspects of a domain which are explicitly assessed are judged to be of more value than those which are not assessed. Teachers are busy people. When push comes to shove we only have time to teach what’s assessed and so then students may not even be aware that there are are aspects of a subject which may in the past have been considered important.
Grammar is an interesting case in point. When I studied English in school, grammar was not assessed and so it wasn’t taught. I had no awareness of most of the meta-language used to describe and think about the relationships between words with the results that my understanding of the subject was impoverished. Now, as the pendulum swings, we’ve decided not to place any value on our assessment of speaking and listening. What will be the consequences?
A – Arguably, if we don’t assess speaking and listening the domain of English will not be adequately represented. But parents’ or employers’ interpretation of students’ performance in English may assume that the test does, in fact, cover speaking & listening and students’ ability in this area may be judged unfairly. The construct of ‘English’ becomes less valid.
B – If we omit speaking & listening from GCSE grades, does this reduce the ability of the assessment to predict a student’s likely success in further study or future employment? This is the evidential basis of result use – we use the GCSE result to provide evidence of something which it is not actually measuring.
C – The consequential basis of result interpretation is fairly obvious: if we leave out speaking & listening then we send the message that if it’s not worth assessing it’s not worth studying. Students will be less likely to see the point in activities which involve such skills and, maybe, become less skilled in these areas. Is this what we want?
D – The consequential basis of result use have social implications. I’m of a generation of English teachers who valued speaking and listening because it formed 20% of a student’s GCSE grade. We committed curriculum time to teaching and assessing the skills covered by the specification. Some of us may decide that this is so important that we will continue to do so despite the changes in assessment, but I doubt it. And future English teachers will be unaware that these were even skills we once values.
This is all very well, but teachers rarely get a lot of choice about how the data they produce is consumed. The Top 20 report would have it that “Effective teaching depends heavily on teachers being informed consumers of educational research, effective interpreters of data for classroom use, and good communicators with students and their families about assessment data and decisions that affect students.” We are urged to “weigh curriculum and assessment choices to evaluate whether those resources are supported by research evidence and are suitable for use with diverse learners.” But what if we decide the curricula and assessments we’re judged against don’t meet these lofty criteria? We can’t just opt out, can we?
Whilst it’s certainly true that it’s important that teachers know how to effectively interpret assessments, it’s even more crucial that school leaders are assessment savvy. It’s always been the case that the outcome of an inspection is largely dependent on assessment outcomes, but if you know how to interpret data then you can make it sing.
The report suggests that an effective teacher or school leader ought to be able to answer these questions?
- What was the assessment intended to measure? This is harder to answer than you might think. You may have an intuition about what a test is supposed to be assessing but sometimes it’s probably helpful to go the horses’ mouth and ask.
- What comparisons are the assessment data based on? It’s useful to know how assessments are referenced. Are students being compared to one another? (norm, or cohort-referenced) Or, are students’ responses being directly compared to samples of acceptable and unacceptable responses that the teacher or others have provided? (criterion-referenced) Why is this important? As Dylan Wiliam says, “Norm-referenced assessments (more precisely, norm-referenced inferences arising from assessments) disguise the basis on which the assessment is made, while criterion-referenced assessments, by specifying the assessment outcomes precisely, create an incentive for ‘teaching to the test’ in ‘high-stakes’ settings.”
- What are the criteria for cut-points or standards? As we saw in Principle 19, how results are reported matters. Raw percentages result in spurious precision while grades or levels result in spurious accuracy. If we don’t know about these things, then we will be unable to meaningfully interpret results.
We must also remain mindful of the suitability of assessment data “for addressing specific questions about students or educational programs, their appropriateness for individuals from a variety of different backgrounds and educational circumstances”. However we assess students will have both intended and unintended consequences. Whether a test is high or low-stakes will fundamentally affect the way students, and teachers, prepare for and perform. It’s important that we keep this in mind when interpreting the data such tests produce. As a general principle, the higher the stakes of the assessment, the less likely it is to produce unpolluted information. Any decision of consequence should be made using as many different sources of information as it’s reasonable to collate.
References cited by the report
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing
- American Psychological Association. (n.d.). Appropriate use of high- stakes testing in our nation’s schools
I also found this paper by Dylan Wiliam very useful and plundered all the stuff about Messick from it.
Although Wiliam gets much citation, quotation and respect, his association to the discredited effect size attributed to formative assessment practices in the classroom (Bennet Educational Testing Services, 2009, http://www.iaea.info/documents/paper_4d5260ae.pdf) notches down his cred. Living in Manitoba, Canada, and working as a learning assessment specialist, I can tell you that Wiliam was seen as a kind of saviour and quoted far and wide at conferences for a study based on a meta-analysis that made claims to huge gains in learning. The problem is that no such study exists. I don’t know that Wiliam has tried hard enough to get the message out that the effect sizes traced back to his publications in 1998 are not from any data analysis.
Wiliam’s assertions became the stuff of ‘urban myth’ and did a disservice.
Hi SheriO – I share your scepticism about the effect size but think you might be doing Dylan a disservice. Here’s what he’s actually said publicly: https://www.learningspy.co.uk/myths/things-know-effect-sizes/
Anyone can make a mistake and everyone should be respected for admitting error.
I don’t see Wiliam clarifying his writing in Phi Beta in your link in regards to effect size in a meta-analysis. Suffice it to say an ‘urban legend’ sprang up (Bennet, 2007) around the promised effect size that penetrated deep into spending decisions and time in Manitoba and Alberta regarding the influence of Wiliam’s claims.
A charlatan does a world of good sometimes. Stirring excitement and then realizing the claims will not pan out, brings out a wizened consumer of educational claims. I like the linking of meta-analysis studies to sub-prime mortgages…be wary.
Wiliam did a world of good though for bringing attention to formative assessment and came through the false claims virtually unscathed, maybe due to the unfortunate condoning off of discussions about academic rigour into academic journals.. Open access, educational journalism and lay scholars, such as you and me, could tip off the unsuspecting masses
waiting to buy the next batch of snake oil.
We all make mistakes – uncharitable to think these are deliberate efforts to deceive. Top tip: avoid certainty.
[…] posts on assessment which might be useful are here, here and here. I’ve also written extensively about feedback; maybe the two most useful posts are here and […]
[…] written before about Messick’s model of validity and how this informs teachers’ interpretation of assessments, but even this might not go far enough. I’m currently picking my way through Newton and […]