The public debate about the assessment of school pupils’ literacy and numeracy has neglected valid and reliable evidence. There are plenty of anecdotes told by people with a political point to make about the first year of testing in 2017-18, notably when, on 19 September 2018, the Scottish parliament voted for the tests in Primary 1 to be halted. There has been the evidence put forward by the teachers’ trade union, the EIS, to the Scottish Government’s routine review of the first year of the tests. This evidence appeared to show widespread concern by teachers and anxiety among children, but it was not based on a scientifically conducted survey, rather on a consultation within the EIS that attracted replies from (in the Scottish Government’s estimate) around 460 responses out of over 54,000 union members. And, on the other side, there have been stories told by Government politicians, in reply to these criticisms, of teachers who have found the tests useful, and of children who have enjoyed doing them. Faced by all this controversy, the Government’s Education Secretary, John Swinney, has announced what he has called a new ‘independent review’ of the tests in the first year of primary school, without specifying what kinds of new evidence the review will collect.
What is really puzzling about these Government responses is that they could have referred to much stronger evidence which the Government itself had already commissioned. That evidence was not routinely made public, and has been obtained by Reform Scotland only in response to several Freedom of Information Requests (with code numbers 18-02228, 18-02327, and 18-02535 in the Scottish Government FoI web pages). Why that route to the evidence was required is itself odd, but is not the main point here. The evidence relates to almost all the issues that have been raised during the recent controversies. On the whole, the conclusions tend to vindicate the Government’s position except as against those critics who reject testing altogether. So if you accept that standardised tests are a pedogically valid way of understanding the progress of individual pupils and the quality of the education system as a whole, then this evidence ought not to be ignored.
It should be acknowledged first that the evidence was collected by the contractor which manages the tests, ACER. Cynics might be concerned about that, but there would be two replies. One is that the criticism of the tests have not been about ACER itself, the quality and integrity of whose work is not in doubt. ACER – which is the Australian Council for Educational Research – is as respectable internationally as, for example, the National Foundation for Education Research in England, or the former Scottish Council for Research in Education.
The other reply would be that the evidence is presented by ACER with attention to much of the detail that is required for the reader to evaluate the quality of the evidence. We can form our own judgement on the trustworthiness of the findings. This does not mean that the research is flawless or that more detail would not be desirable, as we will see; but it itself gives us the means to judge its quality, as good science always ought to do.
There are three main bodies of statistical evidence – evaluation of the workings of the tests in their first year, constructing norms by which the results for individual pupils might be understood, and the first stages in the creation of scales of attainment by which the progress of pupils from the first year of primary school to the third year of secondary might be tracked. There is also non-statistical evidence on how the individual tests were matched to the details of the school curriculum (the benchmarks in the Curriculum for Excellence), and – though unfortunately with less information – on how teachers responded to the tests during their development and during the first year. All this evidence allows us to comment on three broad aspects of the current controversies.
Validity of the tests
The first point is whether the tests are relevant to the curriculum. Claims that they distort the curriculum by forcing attention onto a narrow range of criteria, or interfere with teachers’ capacity to teach effectively, or get in the way of pupils’ capacity to learn at a pace that suits them, all come back to essentially the same point – that the tests are an intrusion that cannot be reconciled with the curriculum’s aims.
In fact, the evidence shows that the tests were developed paying close attention to specific details of the curriculum. The overall contractual requirement is that ‘the content of the Assessments will reflect the knowledge, skills, understanding, and standards embedded within the Curriculum for Excellence experiences and outcomes for reading, writing and numeracy across the CfE Levels.’ This terminology of ‘experiences and outcomes’ is the way in which the curricular details have been described in Scotland since 2010. The curriculum is grouped into ‘levels’: the early level is what most children should learn by the end of Primary 1, first level is by Primary 4, second level is by primary 7, and third and fourth by Secondary 3.
For example, for the early level in numeracy and mathematics, children are expected to learn under various headings, such as ‘number and number processes’, ‘money’, and ‘time’. Examples of achievement which the curriculum specifies under these headings are ‘recalls the number sequence forwards within the range 0-30, from any given number’, ‘identifies all coins to £2’, and ‘engages with everyday devices used to measure or display time, including clocks, calendars, sand timers and visual timetables’. The evidence obtained through the FoI requests shows that the exact same headings are used to group items in the tests, and that specific test items were based on similar examples to these (although, regrettably, detailed examples of items are not given).
Critics have further claimed that testing Primary 1 children is particularly reprehensible because it might contradict the supposedly ‘play based’ principles of the early years. This has been one of the main arguments from the Scottish Conservatives in their opposition to the Primary 1 tests, in contrast to their support for testing at older ages. In fact, there is no such systematic philosophy in any of the curricular documents (as critics of Scotland’s relatively early starting age for school point out). There is selective attention to ‘structurerd play’ in the early-years guidance, but as a means to the end of the beginnings of literacy. For example, children at these ages are encouraged to ‘share stories’ through ‘imaginative play’. The literacy assessments are not able to investigate this because, in Primary 1, they do not assess writing, a restriction which itself was presumably intended to be sensitive to the unavoidable reality that not all children can write at that age. So the tests look only at somewhat passive activites – reading and listening. But the purpose of developing these in the curriculum is enabling children’s linguistic creativity. Although imaginative literary play is not assessed, its necessary precursor, the skilled use of language, is. That approach seems quite consistent with a goal of ‘imaginative play’, given the probably reasonable premise of not assessing writing at this young age.
The validity of assigning specific assessment tasks to one of these curricular headings was judged by the expert panels on literacy and numeracy that had been set up by Education Scotland, which is the Scottish Government agency in charge of the curriculum. It would have been more satisfactory if more public information about the composition of these panels had been published. Nevertheless the process of implementing the tests was overseen by ‘user assurance groups’ that were constituted in the way that all such implementation groups are designed in Scotland – ‘with representation of teachers, head teachers, professional associations, local authority officials, academics and specialists in Additional Support Needs and accessibility’. Thus the relevance of the tests to the curriculum was judged by the same kinds of professional committees as constructed the curriculum in the first place. If the tests are suspect because of how they were developed, then so is the curriculum.
Moreover, ACER has also analysed how difficult pupils found individual questions in the tests to be. A valid test should have questions with a range of difficulties so as to be able to record the full range of pupils’ capacities. The conclusions of the analysis were that the tests were broadly satisfactory in that respect, except perhaps having fewer difficult questions than would be desirable. This evaluation found no evidence of the concern about excessive difficulty that was expressed by teachers in the EIS canvass of its members.
Thus when critics of the tests have claimed that they are cruel – reducing pupils to tears, provoking parents to indignation, or frustrating teachers with their educational irrelevance – then they are in effect saying that the curriculum itself has that potential built into it. Put that bluntly, if it is right to expect a five-year-old child to be able to a read a calendar, then why is it cruel to ask them to do so?
Reliability of the tests
The second point relates to whether it is possible to assess pupils at the specified ages by means of tests. Usually this has been expressed as a particular concern about the Primary 1 tests, and indeed that was the basis of the motion that was passed by the Scottish Parliament to halt these tests. Some organisations have expressed more general doubts about all standardised testing of this kind, for example the Scottish Liberal Democrats and the Scottish Greens. The EIS, though not officially opposed to the principle of standardised tests, nevertheless has said they are concerned about any kind of testing if it is used for purposes other than contributing to teachers’ professional judgement.
Here, again, the evaluation obtained through the FoI request by Reform Scotland is extensive. Nearly 15,000 Scottish pupils in November 2017 and March 2018 were involved in studies specifically designed to establish appropriate norms (quite separate from the routine administration of tests to all pupils of the appropriate ages in that school year). This matters, and is a straightforward corollary of basing the tests on the Scottish curriculum. To interpret the test results for a specific pupil in the tests it is necessary to know what the range of results across all pupils is likely to be. That is indeed the essence of what is meant by ‘standardised’, and its purpose is to try to make sure that pupils are being judged by standards that might reasonably be expected of children of that age who are following this curriculum. The tests that were previously used by 29 of the 32 Scottish local authorities were not based on the Scottish curriculum, and were based on norms established with populations outside Scotland.
The results from these evaluations are reported mainly in terms of measures of reliability, specifically what is known as Cronbach’s alpha. This is a widely used index of the extent to which a batch of individual test items are giving stable information about a child’s capacity in the specific domain that these items are intended to assess – asking essentially whether, if the child was tested again, they would get broadly the same result. The results are reproduced in the table.
Reliability of assessments
| Stage and domain | Cronbach’s alpha | 
| Primary 1: | |
| Numeracy | 0.840 | 
| Literacy | 0.849 | 
| Primary 4: | |
| Numeracy | 0.868 | 
| Reading | 0.880 | 
| Writing | 0.882 | 
| Primary 7: | |
| Numeracy | 0.889 | 
| Reading | 0.860 | 
| Writing | 0.820 | 
| Secondary 3: | |
| Numeracy | 0.880 | 
| Reading | 0.887 | 
| Writing | 0.780 | 
Source: response to Q1 in FoI request 18-02228
The general rule invoked when interpreting reliabilities is that values above 0.8 are ‘good’ and above 0.9 are ‘excellent’ (see for example the guide here). By this criterion, these values are not too bad, especially for the first year of an assessment system that has built into it a deliberate intention to improve.
One relevant yardstick is to compare these reliabilities with those which have been achieved in England for the National Curriculum Assessments that have been in place since the mid-1990s. There were concerns at the beginning in England, too, that the tests would be unreliable. An evaluation about a decade ago by Paul Newton of the Office of the Qualifications and Examinations Regulator (Ofqual) found that, by 2007, most of the reliabilities lay between 0.8 and 0.9. A more recent evaluation by Ofqual found reliabilities above 0.9. So, already in their first year, the Scottish tests mostly seem to have nearly reached these high levels.
Reliability is not the most intuitively appealing way of understanding the quality of tests. Perhaps a clearer way of thinking about is to ask this question: how likely would it be that the tests would classify a pupil’s level of achievement wrongly? In the English research, this was defined to be making an error in judging which level of the national curriculum a pupil had reached. Back at the beginning of the national curriculum, in the mid-1990s, it was estimated by Professor Dylan Wiliam that there was a 30% chance that the tests would get the level wrong. This figure was widely disseminated as a sign of how unreliable the tests were. Other evidence suggested that it was unduly pessimistic, and in any case the recent evaluation showed a much lower probability of miss-classification – around 10% for mathematics, 13% for science, and 15% for English.
Without more statistical information about the results of the Scottish tests than has been provided, we cannot properly estimate the probability of miss-classification here. But a very crude estimate might be this. The gap of 5 percentage points in miss-classification between mathematics and English in England corresponds to a gap of 0.045 in reliabilities. The average reliability of the Scottish tests in the table above is 0.86, which is 0.06 below the reliability for English in England. If the probability of miss-classification rises approximately proportionately to the fall in reliability, then we might estimate the probability of miss-classification from the Scottish tests to be about 7 percentage points above the probability for English in England, or about 22%. That is slightly worse than the position reported by the National Foundation for Educational Research in England about a decade ago. If refinement to the Scottish tests over the next few years could increase the average reliability to over 0.9 from 0.86, then that probability of miss-classification would drop to at most 15%.
These estimates are, however, unacceptably crude, and it would be much better if they could be replaced by proper estimates from ACER of the probability of miss-classification, using the data which they have collected during the first year. Publishing these results would be a useful outcome of the Government’s new independent review.
The point of all these technicalities is that the new Scottish tests are already giving reasonably reliable information, even for Primary 1 pupils. Contrary to the fears of their critics, this psychometric evidence suggests that it is possible to assess pupils in ways that are relavent to the curriculum and that produce results that can be broadly trusted. Furthermore, the tests are likely to become more trustworthy as the new system goes through the improvement process that is built into its design.
Educational use of the tests
The controversy around the tests also raises questions about how they might be used. For example, the EIS persuaded the government early on to promise not to publish the average test results for individual schools.
Some features of the tests, as now released through Freedom of Information, show encouraging sensitivity to educational concerns, but other aspects of the reporting of the tests remain opaque.
The most promising aspect of the proposed reporting is the construction of what are called ‘long scales’. These are intended to place the results of all tests – from Primary 1 to Secondary 3 – onto a single scale so that pupils’ progression can be measured. That information would allow teachers and parents to develop an understanding of the progress which children are making as they go through school. Never before has this kind of information been available to Scottish parents, since all previous modes of reporting to parents have been vague judgements rather than specific results.
These long scales were constructed by a combination of the evidence from 15,000 pupils relating to reliability (noted above) and evidence from a further approximately 16,000 pupils in the intermediate school stages that are not included in the tests. For example, this allowed a check to be made that children in Primary 3 were closer to the results of children taking the tests in Primary 4 than to Primary 1, and that children in Primary 2 were closer to Primary 1 than to Primary 4.
The resulting scale will form the basis of the reporting of test results to parents. In the draft reporting format, parents will be given their own child’s test results, and the corresponding average results for the child’s school and nationally. It seems likely that local authorities will add also the results for the authority as a whole. It is intended that these reports will be in terms of 12 bands, covering attainment from the beginning of Primary 1 to the high end of Secondary 3. The bands will be described in language drawn from the Curriculum for Excellence, following through the mapping of the tests onto the curriculum that is described above.
This all looks quite sensible, although much piloting will be required to see how accessible these quite technical documents will be to parents. It is to be hoped, moreover, that the eventual reporting at national and local-authority levels will show rates of progress up the 12 bands, not merely the proportion at each band in each year. The most poorly explained aspect of the proposed reporting, however, is the step from the bands resulting from the tests to the assignment of pupils to levels of the Curriculum for Excellence. We are told that teachers will do this using their judgement, because, according to the Freedom of Information release, it is ‘inappropriate to simply align individual [assessment]outcomes with overall professional judgement of achievement of a level’. If the test results ought not to be mapped onto curriculum levels in this way, one wonders why all the effort has been put into doing precisely that (as noted above). More to the point, we are left wondering how teachers will carry out this mysterious exercise of ‘judgement’. Trusting teachers’ professional judgement has become a Scottish mantra, much invoked by the EIS. But a truly self-confident and expert profession would explain to society how its judgements are reached.
More controversial will be what extra information is provided alongside the test results. For example, teachers are free to test children at any time in the school year, another consequence of pressure by the EIS on the government. That makes interpreting the results of tests quite difficult unless account is taken of age. Even at Primary 4, the difference in maturity between, say, early September and late May is about one tenth of a child’s life to date; in Primary 1, that period is about one sixth. The proposal to report against norms in November and March is too crude to capture these differences.
Nothing has been said about how the reporting will take account of such matters as gender, socio-economic circumstances, or home language. On grounds of equity, it is indeed reasonable to show all children against a common standard. Otherwise, we would be implicitly having lower expectations of some children than of others. But in order to explain the achievements of particular children, some contextualising is required. For example, consider a child who has nationally-average attainment in a school which itself has below-average attainment because of social deprivation in its catchment area. So the child would appear in the report to be doing well in relation to the school but not particularly well in relation to the national average. Without explanation of why the school average is low it would not be possible for the parents to understand their child’s performance. Indeeed, without that explanation, attributing the child’s merely average performance to the school’s seemingly poor quality would be an understandable but inaccurate parental response.
Further complicating these already complicated concerns about reporting is the need to explain the inevitable element of randomness even in the best-designed system of assessment – the probabilities of miss-classification noted above. Incidentally, we would not avoid these problems by not testing, and by relying wholly on teacher judgement. It, too, is subject to random error, but inscrutably so unless we have objective tests.
Versions of these dilemmas will multiply at all levels of reporting, whether nationally or at the level of the local authority. They will be exacerbated by the inevitability of school-level reporting, whatever the EIS and the government might want. Because the schools (and the local authority) will have to calculate the school-average attainment, that information will be subject to a Freedom of Information request. In almost all circumstances, there will be no grounds for withholding it, because, except in very small schools, it would not reveal the identity of any pupil or teacher, and would not be covered by commercial confidentiality, since the data would by then be owned by the authority.
Conclusions
All the information which has been obtained by Reform Scotland through FoI requests ought to have been automatically in the public domain, because it answers many of the concerns that have been raised, as this blog has sought to show:
- The tests are valid, in that they have been based on the Scottish curriculum.
- The tests are acceptably reliable, though not outstandingly so, and there is a planned programme of refinement that should lead to improvement.
- The tests offer the potential of informative reporting to parents and to Scottish society. That potential is much better than anything which Scotland has ever had, but more thought has to go into how to do it effectively.
Although the EIS has shown some evidence of discontent among teachers, the representativeness of the opinions which the union gathered from those members who chose to respond to its request for comment cannot be determined without a scientifically valid survey.
The politicisation of this issue is regrettable. It is to be hoped that the debate might move to grounds that are more firmly based on psychometric evidence, and by more systematic information about the experiences of the pupils and teachers than has been available hitherto.
Lindsay Paterson is professor of education policy in the School of Social and Political Science at Edinburgh University. His main academic interests are in education policy, social mobility, civic engagement, political attitudes, and statistical methods for social science.