Tuesday, 24 February 2015

On single malt and mastery

You may be surprised to hear that I don't know a great deal about whisky making. OK, maybe not, but I'd wager that if you were in the business of ageing a 12 year old single malt, you probably wouldn't expect to notice much difference if you tasted it every 6 weeks. You might think you can, but it's probably just because you made up your mind before testing it. 

In recent weeks I've blogged about the dubious approaches of certain tracking systems towards assessment without levels; how they are attempting to shoehorn the new curriculum into the old methodology of points scores whilst trying to convince us they've done something new. I've also written about the pressure many primary schools are under - particularly the more vulnerable ones - to continue quantifying small steps of progress in 'APS'; to 'prove' that pupils are making 'good' progress over short periods of time even though such measures are meaningless, irrelevant, and at odds with the new curriculum. 

My big fear is that many people involved in school improvement are still defining 'good' progress in terms of extension in a curriculum that focuses on depth; and that some tracking systems, with their adherence to the old orthodoxy of points - albeit with a bit of window dressing - are not exactly helping schools escape the gravitational pull of levels.

Senior leaders are understandably anxious. When I discuss these issues in schools - my main topic of conversation these days - someone will always say:

"But we have to show progress"

This tells us a lot about the primary role of data in many schools. It exists not so much to inform teachers and senior leaders about pupils' learning; but rather for the purposes of accountability. A defence shield against scrutiny and judgement. The data is becoming disconnected from the classroom as a consequence and, worse still, may not mean anything at all. 

Eventually, after some debate, we move away from "but we have to show progress" and arrive at the inevitable question:

"So, how do we show progress?"

My original answer of "I don't know", which was usually met with howls of derision and gnashing of teeth, has now been replaced by the slightly more definitive "I'm not sure we can". This needs qualifying. I'm not saying we can't measure progress; I'm saying we can't measure progress in the way we've become accustomed. A curriculum in which most pupils are expected to learn at broadly the same pace, and where extension into the next year's content is the exception rather than the rule, limits our options for measuring progress in terms of coverage. The depth of pupil's understanding - the degree to which they can use and apply what they've learnt - is the key now, but it's not something that is easily measured. And assigning values to such an abstract concept is most likely a fallacy especially if we begin to try to show changes in depth of understanding over short periods just because "we have to show progress".

Essentially, we must resist the temptation to quantify something that can't be quantified just to satisfy demands based on some obsolete notion of progress. When the data is without foundation and is at odds with reality, it is akin to knowingly navigating with an out-of-date map and still expecting to arrive at your destination.

To put it bluntly, don't make it up.

So, let's return to the original question: "How do we show progress?"

I've touched on tracking systems above, and blogged on the subject extensively. There are clearly issues with certain approaches and schools will no doubt be finding these out for themselves now that we are halfway through the year. I found myself discussing one such approach with a headteacher I met in the street recently and after trying to recall and explain its complexities for 10 minutes or so I had a slight out-of-body experience. To anyone listening in, it must have sounded like we were talking total nonsense. 

Surely what we want is simplicity. If levels, and their sublevel and APS offspring, are too complicated, then whatever replaces them needs to be more straightforward. Most systems are settling on an approach involving 3 broad steps of learning, often with further subdivisions. Michael Tidd's key objectives model sticks to 3 simple categories of 'below', 'meeting', and 'exceeding'. Target Tracker's system involves 6 steps across the year from 'below/below+, through working within/working within+, to secure/secure+. OTrack have opted for a 7 point scale which includes 3 for 'working towards' and a final category of 'exceeding', which sits above 'mastery' (which makes sense to me). Focus Education have categories of 'emerging', 'expected', and 'exceeding', each subdivided into 3 steps to give a 9 point scale, the 9th point being reserved for those pupils that have shown mastery of all main objectives and completed all extension activities. Insight Tracking, who aren't tied to any particular assessment model, have developed a flexible system that accepts pretty much any metric. 

Another system with a 6 point scale, similar to that employed by Target Tracker, is Chris Quigley's Milestones. There is, however, one crucial difference: the Chris Quigley model assesses a pupil's depth of understanding across a 2 year period, the expectation being that the pupil reaches step 4 by the 2 year milestone with points 5 and 6 indicative of mastery.

Such a system, which eschews attempts to quantify small steps of progress in favour of a long term mastery approach, is daunting to some, especially those that "have to show progress" since the start of term, but that's what makes this it compelling. 

So, ask yourself this: who are you assessing for? Does your data actually tell you anything about your pupils' learning? Or is it for another purpose?

It's time to entertain the possibility that maybe it's just not possible to quantify small steps of progress. And if that's the case, why bother continuing to do so? The sooner we move away from this deeply entrenched progress fallacy, the sooner we can start to assess in a meaningful way, one that actually informs us about pupils' learning.

This is, after all, why levels were removed in the first place.

Wasn't it?

Tuesday, 10 February 2015

50 shades of blue and green

I have a confession: I am not a statistician.

There, I said it.

(Feels good to get that out in the open.)

I actually have a Ph.D. In Geology. I'm not bragging; it's more like an admission. As anyone who knows anything about Scientists will testify, us Geologists don't live on the brightest side of the academic village. I spent 4 years wandering around remote, rocky outcrops of Scotland and Newfoundland with only a sledgehammer for company. That sledgehammer did come in handy during one particularly awkward encounter on a Channel Island headland once, but I digress.

My wife is a statistician. She has a degree in statistics and teaches GCSE and A-Level Maths. The rest of this blog is actually written by her.

I'm kidding.

But it probably should be.

So, I thought I'd have a break from ranting about levels; and new levels that aren't supposed to be levels but are really; and tracking systems that don't work; and people who don't really understand how assessment has changed; and have a little look at statistical significance.

I am now standing slightly outside my comfort zone. 

Or should I say my confidence interval?

One of the things I regularly do in my work with schools is isolate the issue. By this I mean I remove the 'causes' of statistical significance from the data set. My aim is to turn the eye of Sauron away from the data as a whole, towards individual pupils that have, for whatever reason, underperformed; to demonstrate that underperformance of just a few pupils can have an extreme impact on school data. Prepare some case studies on those few pupils and you are on more solid ground.

Obviously it swings both ways and high performance of a few pupils can have a positive effect. 


I recently did some work with a very large junior school in London. The school was RI, due an inspection and had dreaded blue boxes for VA. This is a school with over 100 pupils in a cohort but even so, recalculating the VA with just one 'underperforming' pupil removed was enough to push the overall VA score above the significance threshold. No more blue box. Just 1 pupil out of 100, that's all it took. And this was a pupil with a very interesting case study. 

If you want to visualise statistical significance and confidence intervals, as used in RAISE, then turn to the first VA page in the RAISE report. You will see the VA score, e.g. 99.3, and a confidence interval of say +/- 0.6. Add that number onto the VA score (99.3 + 0.6) and if the result is still below 100 then the cohort is deemed to be significantly below average, as it is in the example above. If the result is greater than or equal to 100 then it's below but not significantly below. At the other end, when the score is above 100, subtract the confidence interval. So, for a score of 100.7 with a confidence interval of 0.6, the lower result (100.7 - 0.6) will be above 100 and therefore significantly above. If the original VA score was 100.6 then it won't be. A difference of 0.1 is all it takes to be 'significant'. The dreaded blue box or the celebrated green one.

1/10th of a point. 

1/20th of a sublevel.

1 pupil having a bad day and falling 2 points short of their VA estimate in a cohort of 20. 


But it is the misinterpretation of these data and how much importance is placed on them that is even scarier. We commonly hear people saying things like 'results are significantly below average' or, even worse, that 'pupils are making significantly less progress than average', which is just plain wrong.

So what does it mean? Well, in the case of RAISE, where a 95% confidence interval is used, it means that confidence interval of the sample (i.e. cohort) will contain the population mean in 95% of cases. If a sample's confidence interval does not contain the population mean (i.e. is in the 5%) then it is implied that the deviation from the mean did not happen by chance. 

And there are of course false positives and false negatives where the supposedly significant deviation from the mean has happened by chance. 

There are so many things wrong here:

1) it assumes that cohorts of children are random samples of the population. Anyone who knows anything about schools, demographics, catchments and the shenanigans that go on to secure school places knows this is complete fantasy.

2) significance suggests that the deviation from the mean has not happened by chance, but does not tell us the cause. Is it the school's fault? Is it really down to quality of teaching? Or are external factors having an impact on pupil performance? This is why it's so important to isolate the issue and get those case studies prepared. 

3) False negatives and positives - significant deviations from the mean that happen by chance - do occur. These would be relatively rare if cohorts were drawn at random from the population, but they're not, so who knows what we're looking at. It's a mess.

4) pupils are not retested. A simple analogy: you throw a dice 6 times and the average score is 1.3 against an expected average of 3.5. Is this significant? Do you assume the dice to be faulty? Perhaps there is something wrong with the way you are throwing it. Or perhaps it just happened by chance.  Maybe throw it a few more times and test it again. But this is just dealing with a simple thing like a dice, not something complex like a cohort of children with all their inherent variables.

5) a confidence interval is all about uncertainty. There is too much certainty being placed on uncertainty by some people.

I recently attended a small meeting with some people from FFT and it was heartening to hear one of their senior statisticians talk about these issues. He was uneasy about the phrase 'statistical significance' because educational significance was all too often inferred from it. We discussed alternatives and it's a discussion that needs to continue. I don't have a huge problem with particularly high or low data being highlighted in some way but users need to understand its limitations, and absolutely must realise that statistically significant does not necessarily mean educationally significant.

Perhaps we should move away from the on/off switch of statistical significance; this apparent exactitude of uncertainty.

Perhaps what we need is 50 shades of blue and green.