Is it possible to get assessment right?

No.

After my last blog on how to get assessment wrong, various readers got in touch to say, OK smart arse, what should we do?

Well, I’m afraid the bad news is that we’ll never get assessment right. Or at least, it’s impossible for assessment to give us anything like perfect information on student’s progress or learning. We can design tests to give us pretty good information of students’ mastery of a domain, but as Amanda Spielman, chair of Ofsted said at researchED in September, the best we can ever expect from GCSEs is to narrow student achievement down to + or – one whole grade! We could improve this huge margin of error, but only if we were willing for students to sit 114-hour exams.

As Daniel Koretz makes clear in Measuring Up, a test can only ever sample what a student is able to do. And while metrician’s work very hard to design tests which provide as representative a sample as they are able, they are confounded by a seemingly insurmountable confounding factors, particularly the wording of and rubrics and the attitudes of test takes and teachers.

So, we can never design a perfect assessment system. Of course we can (and should) try, but we need to know we’re using a metaphor to map a mystery. The mystery is the imponderable complexity of what’s going on in students’ brains; the metaphor is our feeble attempts to define and describe what learning looks like.

So, with this in mind, can we design assessment systems that are at least useful?

Again, that depends. It mainly depends on what we’re intending to use assessment for. For all their faults, national tests, are useful in that they give a broadly understandable and comparable approximation of a student’s effort and ability to master a domain of knowledge. This is a necessarily blunt measure, but we all agree to ignore all but the grossest errors because it’s the best we can do. We might like ’em, but exams are a pretty reliable proxy of how well a student has done at school. To that end, SATs, GCSEs and A levels are flawed but workable means to compare how schools are performing.

But we also want to use assessment systems for other, more formative purposes. We use them to report progress to parents, share information with students and inform decisions about teaching and curriculum design. Can we design assessment systems that help us do these things?

We can, but they’re not going to be very good. The only real choice we get to make is how bad they’re likely to be. One way to go is to follow in the footsteps of Trinity Academy’s Mastery Pathways. This is a system of assessment inextricably intertwined with curriculum and teaching, where the basic foundational skills and knowledge are taught first and then repeated until students have proved they have mastered them. I can really see how this works in a subject like mathematics: students learn an elementary foundation, take a test then either repeat the stage or progress on to a slightly more advanced content which depends on the mastery of the previous stage.

So, after the first block of content in Year 7 – Elementary 1 – students have to prove they can:

Multiply and divide numbers by 10, 100 and 1000 including decimals.
Recognise and use square and cube notation.
Long multiplication of upto 4 digit numbers by 1 or 2 digit numbers.
Short division, upto 4 digits by 1 digit (including remainders).
Short division, up to 4 digit by 2 digit where appropriate.
Understand different ways to represent remainders.
Recognise mixed numbers and improper fractions.
Convert between mixed numbers and improper fractions.
Read, write, order and compare numbers up to three decimal places.
Read and write decimal numbers as fractions.
Use common factors to simplify fractions; use common multiples to express fractions in the same denomination.
Multiply proper and improper fractions by an integer.
Round a number with two decimal places to a whole number and 1 decimal place.
Work with co-ordinates in all 4 quadrants.
Express one quantity as a fraction of another, where the fraction is less than one or greater than one.

Once they’ve proved they have mastered this, they move onto Elementary 2. And so on, all the way to to Advanced 3. Each step is a predefined programme of study which tudents either pass or fail. This provides real clarity over what students know and don’t know at a given point in time and it can be used to identify students who don’t make progress. They’ve done something similar for science, and, apparently English. I’ve not been able to look at any of their English resources and can’t really understand how it might work for anything other than grammar. Clearly, you wouldn’t want students repeating predefined stages if the content is a body of literature. I’d be very interested to see how they get around this.

One way around is to follow Michaela’s model where students follow a sequenced curriculum where English and humanities support each other by beginning in the ancient world with each subsequent topic building on the knowledge already gained. In this model students take weekly knowledge tests where the aim is to achieve 100%. Progression is determined on whether or not students have retained lesson content. Take a look at Joe Kirby’s blog for more details.

Another way to go, is to use threshold concepts in an attempt to anticipate where students are likely to get stuck and help guide them through liminal space. Here’s my example of a Key Stage 4 English curriculum designed along these lines. This is all very well, but how can you ever hope to assess how students are progressing through these thresholds? My current answer is to describe what the journey might look like. This is a path fraught with danger. Daisy Christodoulou makes clear that performance descriptors are, in the main, a nonsense. As she says,

The problem is not a minor technical one that can be solved by better descriptor drafting, or more creative and thoughtful use of a thesaurus. It is a fundamental flaw. I worry when I see people poring over dictionaries trying to find the precise word that denotes performance in between ‘effective’ and ‘original’. You might find the word, but it won’t deliver the precision you want from it. Similarly, the words ’emerging’, ‘expected’ and ‘exceeding’ might seem like they offer clear and precise definitions, but in practice, they won’t.

And of course, she’s right. You can never hope for precision from performance descriptors, but then precision will always be impossible to achieve. Maybe what you can get from performance descriptors is narrative and meaning. Here’s the metaphor I’ve been using to map the mysteries of learning English as an academic subject:

Of course, this is not precise. Although some care has been put into making the descriptors as meaningful as possible, it gives us only the vaguest hints at what students might be able to do as they ‘progress’ through the six thresholds, but what it might provide is a means of discussing what it means to be stuck, and describing what students might be able to do in the future. It has been designed with a particular curriculum in mind and so should not be taken as something able to stand alone, but even so, it should be seen as not so much a map as a travel guide – pointing out interesting sights and potential places of interest along the way. My idea is that students can be given a ‘performance graph’ at various points within the curriculum to provide a snapshot of some of the things they can do now. Maybe over three different assessment points you might get something like this:

If you prefer, we could try to show all this on one table:These graphs might be useful for students and parents as a starting point in a conversation about how the journey seems to be going. As long as you prevent yourself from be fooled into thinking they provide proof of anything or that you can ever be certain about what a students is learning, I think this system ‘works’.

To sum up, I think there are three main points I’d like to make about assessment:

It will always be imprecise.
Assessment should never be a substitute for the actual work students produce.
Assessment should never be used to make judgements on anything as complex as learning or progress, and should only be used to make judgements on achievement or attainment when left to the experts at the end of Key Stages.

David Didau2015-05-31T11:18:35+01:00May 23rd, 2015|assessment|

25 Comments

blazeofriot May 23, 2015 at 2:28 pm - Reply

Worth looking up, if you haven’t seen it, the work done by Alex Ford on assessment/progression in history, using historical second order concepts – a lot of shared territory re: threshold concepts but removes idea that progression through them is linear.
- David Didau May 23, 2015 at 2:29 pm - Reply
  
  Ah, thank you. Do you have any links?
  - Chri May 23, 2015 at 9:46 pm - Reply
    
    http://www.andallthat.co.uk/blog/progression-in-historical-thinking-updated
  - Nick Hitchen (@nickhitchen) June 4, 2015 at 12:29 pm - Reply
    
    Love these articles, David; can’t wait for #wrongbook.
    
    If we accept the Hirsch/Willingham argument that comprehension is dependant on knowledge, I wonder if the only solution is to abandon attempts to find a single level/grade/metric to describe attainment in reading, and assess on two levels: data & ‘metaphors’.
    
    We could restrict the D-word to single-word reading tests, or multiple-choice ‘spot the metaphor’ tests (I can’t help but wonder if one of the logical implications of the Hirsch/Willingham school is that we should really be reporting ‘no. of books your child read this academic year’, too).
    
    Any assessment of comprehension using your prose criteria could be recorded, tracked and discussed as a metaphor, a one-shot inference. No more, no less. No SLT members saying ‘I see Hiba’s metaphor has gone down 6 months since you’ve been teaching her…’
    
    To my mind, a system clearly dividing reading assessment into data & metaphors would satisfy the complaint DC raised about any prose descriptors falling foul of the ‘adverb problem’; here’s what we know; this is what we suspect.
    
    Admittedly, this adds a degree of ambiguity and complexity to reporting/accountability , but wouldn’t parents accept it if we said ‘Well, we can say with some certainty that your son has a single-word reading age of 12 years 2 months. We believe he’s read 13 books since September, and we can say with some degrees of confidence that he has improved at explaining how and why authors use figurative language. Now we need to get him reading something other than football magazines…’
johnbloggs May 23, 2015 at 2:59 pm - Reply

Hi David, thanks again for a really interesting blog. I was wondering, how by leaving out the labels from the assessment grid & there being no ‘levels’ to be sought from the graph, how do you provide students, teachers, parents & schools with the data that each want for differing reasons (e.g. a student moving to next year group) without the need for teachers to spend considerable time on analysis of the metaphor (the graph in this case)?
- David Didau May 23, 2015 at 3:16 pm - Reply
  
  Without “considerable time on analysis” any judgements will be necessarily poor. We might want data to perform all sorts of functions, but the truth is, it can only ever give us the illusion of what we want. The message is resist the lure of easy certainty: it is always false.
  - johnbloggs May 23, 2015 at 3:47 pm - Reply
    
    Thank you, yes I would certainly agree and I recognise change within institutions is a difficult battle regardless of what your changing. I do wonder though how such a shift in attitudes can be made without smaller steps in between. Within my own context for instance the trend has been to ‘stick to what we know’ & just to simply adapt levels to new objectives. Being provided the opportunity to make even small changes has been seemingly met with resistance.
    - David Didau May 23, 2015 at 4:15 pm - Reply
      
      You’re right – most schools will ‘stick with what they know’. This can only be viewed as a missed opportunity.
      - johnbloggs May 23, 2015 at 4:36 pm
        
        It’s a shame but I think your right. I suppose the task is for those of us who wish for something different is to find ways within our own contexts to make it so. Thanks for your responses, much appreciated.
ingotian May 23, 2015 at 5:18 pm - Reply

First I don’t think looking for a perfect assessment system is a good starting point. An assessment system will be an optimised compromise between the information it provides on learning outcomes and the cost of doing it. Take the computing baseline testing as a concrete example. 50 multiple choice questions to find out what children know and do not know about computing with follow up tests to gauge progress. Free to all schools and feeds back statistical info (https://theingots.org/community/baseline_test_statistics) confidentially to the school contextualised in national data (nearly 70,000 tests and 3 million data points). Now we could say MCQs are not perfect – they aren’t – but the real issue is whether it is worth doing this or not. It takes a lesson up every 6 months so there is a cost even though the service itself is free. Information such as the great majority of Y7 not knowing what the word integer mean is useful, knowing boys and girls start off performing identically, that Y7 do better on some questions than Year 10. It’s not perfect information but still useful information for a teacher that wants to use evidence outside their own classroom as a basis for improving teaching. We can arbitrarily map test scores to levels or we can use formative assessment matching criteria to performance – again there are free web tools to do this to make management of marking easier. In the end the teacher can decide how much or how little of this optimises the assessment cost-benefit equation. In the end perfection needs to be replaced with best fit and good enough for the desired outcome.
- David Didau May 23, 2015 at 9:23 pm - Reply
  
  Thanks Ian – as I said in the post – MCQs are one way to successfully and meaningfully assess what students can do. Have a look at Joe’s blogs on what they do at Michaela.
  
  We can’t reliably infer learning from this, but it’s a good starting point.
  - ingotian May 23, 2015 at 9:51 pm - Reply
    
    There is plenty of robust research evidence for recognition as a motivator – mostly in business but I doubt that means it is somehow different for children. http://www.forbes.com/sites/dailymuse/2013/03/19/the-secret-to-motivating-your-team/
    
    I would say displaying examples of good work from day to day efforts is far more valuable than doing things specially for display. (There will be exceptions). And it is less work because it is just part of what is being done anyway. Same principle as we at TLM use for coursework for Tech Awards in IT. Get the evidence from general day to day operations because it minimises the diversion of teachers from teaching and is a better indicator of practical competence.
julietgreen May 23, 2015 at 5:29 pm - Reply

I agree your views on this, David and lately have begun to doubt the point of this type (to indicate attainment/progress, if you like) of assessment entirely. I see two problems – the abiding but misguided belief in the numbers, once they are generated, and a passive, unquestioning acceptance of the ‘criteria’, once these are written in black ink. Add to that a complete lack of consensus on meanings and gradings and it’s all rather frustrating and depressing. I have to confess to a growing hatred for the use of ‘teacher assessment’ as a catch-all panacea whenever it’s obviously too difficult to design something more effective.
- ingotian May 23, 2015 at 5:40 pm - Reply
  
  It’s down to teacher professionalism to understand the limitations of assessment and apply it accordingly. Probably it is impossible to design an assessment system better than a good and well-informed teacher’s observations. We can just provide tools to make it easier to do that better and more consistently.
- ingotian May 23, 2015 at 6:12 pm - Reply
  
  Your thoughts seem more about politics than assessment. Teachers assess effectively all over the place, we moderate work from hundreds of them and it is generally to a good standard. Assessment that is good enough for what it is intended to do and there are quite a few different purposes to consider. Part of the professional competence of a teacher is to be able to establish priorities. Like a doctor when making a diagnosis.
julietgreen May 23, 2015 at 6:01 pm - Reply

Understanding the limitations of assessment and applying accordingly is very, very rare indeed, not just among teachers but much more worryingly among school leaders who are seduced by data. I don’t think there is a teacher ‘good and well-informed’ enough to overcome the unreliability of ‘teacher assessment’ nor deal with the detrimental influence of the high stakes uses of the data.
Assessment – it’s all in our heads | Reflecting English May 23, 2015 at 9:44 pm - Reply

[…] have read David Didau’s recent two posts on assessment – here and here – with interest. David is rightly sceptical about the efficacy of assessment rubrics and has […]
Colin Riggs May 24, 2015 at 6:47 pm - Reply

Nice blog – (grade) descriptors will never be perfect – but should try to respond to the statement: ‘this is what I think they can do given the evidence I have seen’…
Alex May 29, 2015 at 11:21 am - Reply

Is there any room for computing to be involved?
I mean if learners somehow did work on the computers it would be possible for the computer to track allsorts of information at the same time and then we could produce a snap shot of the learner at that point in time.
there would be no need for assessments in that case as the computer would monitor everything the learner does and factor that into the snap shot.

I know it seems a bit pie and the sky and I’m not suggesting this could be done tomorrow, but from a theoretical view, is it even worth exploring?
Assessment is difficult, but it is not mysterious | The Wing to Heaven May 31, 2015 at 11:07 am - Reply

[…] response, David Didau wrote this post, in which he agreed with a lot of the things I said. I was pleased by this, because I really admire […]
Crispin Weston May 31, 2015 at 11:55 am - Reply

There is no difficulty in setting 114 hours of assessment – if your assessment is embedded within the classroom and this routine, tracked practice forms the fundamental part of your routine teaching. And why should it not? Practice is how we learn.

I don’t think the problem with assessment is the “confounding factors”, in the strict sense of this term – but in the unreliability of the assessment process. But this is precisely why you need the 114 hours – to compensate for that unreliability by performing repeated sampling and analysing the reliability of that sampling by analytics which show the correlation (or lack of correlation) between your samples. This “consistency of results” is what reliability, when applied to assessment, *is*.

So we can never design a perfect assessment system – but we can get damn close, as a statistical probability tends to 100% with repeated sampling.

It is true that assessment will always be imprecise to some degree – but that level of imprecision is quantifiable. All assessment should therefore be accompanied by a statistical confidence level. The most significant thing that Amanda Spielman said – you will remember I was sitting just in front of you – to my mind and I think to yours as you asked a shocked question on this point, is that the only reason current assessments are not accompanied by such confidence ratings is that it would be politically unacceptable to parents for little Johnny to come out of school with a piece of paper that said that he had got a C in Maths, but no-one was really sure that this was deserved. In other words, we are in a Catch 22. Our assessment system is so unreliable, that we cannot afford to take the remedial action that is required because that would require us to admit how unreliable it is, which would cause a scandal. So we keep the can of worms unopened.

In my view, you take this in the wrong direction.

First, I don’t understand the significance of your point 2. The work that students produce is meaningless unless you assess it. All that e.g. student portfolios do is to delegate the job of assessment to the person looking at the portfolio, which might be a useful model in some circumstances (e.g. when the subject matter is very subjective, as in art) but which is problematic in most circumstances.

Second, you say “leave it to the experts” but we know that the experts get it wrong, mainly because they do not have access to enough samples. The guarantor of reliability is not assessment expertise but statistically demonstrated reliability through repetition. This can only be achieved by embedding assessment within the teaching cycle.

The only way this will be achieved, in my view, will be through digitally mediated and automatically tracked practice.

Ah yes – that means I agree with Alex!
because…I hate writing reports. | you get what anybody gets June 9, 2015 at 4:46 am - Reply

[…] into something workable. Daisy Christodolou has interesting thoughts here and David Didau has some here. I’m very fond of the ‘hornets/butterflies’ metaphor that Joe Kirby discusses […]
Herding cats would be easier. | Ramblings of a Teacher June 11, 2015 at 9:10 pm - Reply

[…] unpicking why children do or don’t make progress is hard. As David Didau says: we’re using a metaphor to map a mystery whenever we make such […]
Heads I’m right, tails I’m not wrong | David Didau: The Learning Spy October 12, 2015 at 8:58 am - Reply

[…] instance, I wrote some months back about how I thought we might get assessment right in a post-levels world. I shared a system I had developed which some schools have adopted and of which I felt rather […]
Nikki Booth July 10, 2018 at 6:31 pm - Reply

Hello David. Really enjoyed reading this blog. Do you happen to have a reference point for the Amanda Spielman quote at all? Was the ResearchED presentation you refer to recorded at all? Thanks very much.