Why was my student’s scale score higher than the most difficult item on their test?

This article explains why a student might achieve a scale score higher than the most difficult item on their test based on Item Response Theory (IRT).


The following topics are covered here:

Student scale score estimates

A student’s scale score can be higher than the difficulty estimate for the most difficult item in their test because scale scores represent overall ability in a learning area, such as reading or maths, not just performance on a single question. 

We use the term ‘item’ as it is more accurate than the colloquially used ‘question’. Test items are not always questions, but instead statements or visual prompts to be responded to or instructions to follow (for example, ‘complete this sentence’ or ‘simplify the equation’). The type and format of each item is determined by the skill and knowledge being assessed.

For an assessment to produce valuable information about students’ abilities, it needs to be appropriately targeted to uncover what students can do and understand, as well as what they cannot yet do and understand. So, when a student responds correctly to approximately 50% of questions, the test is well targeted and provides the maximum information about the skills a student is demonstrating, and those they are still developing. In these cases, the student’s scale score will fall somewhere within the range of difficulty of the items on their test (see image below). 

individual report some items correct.png

Report 1

In cases where a student has responded correctly to all, or almost all, items, they are likely to receive a scale score that sits above the difficulty rating of the most difficult item they saw (see image below). 

individual report all items correct.png

Report 2 

This is expected, because – based on the evidence elicited by the test – the student's ability appears to exceed the level of skill required to answer the most difficult items. But with very little evidence from the test of what the student cannot do, we are less certain of the accuracy of this student's ability estimate than we would be if they had a mix of correct and incorrect responses. This means that the student’s estimated scale score may be overestimated. 

 

Item Response Theory

Scale scores are derived using an Item Response Theory (IRT) model – in the case of PAT, the Rasch model – which considers two key elements: the difficulty of each item the student attempts and how many of those items are answered correctly. Both the difficulty of the items on a test and student ability are expressed on the same scale.

The Rasch model uses the student’s responses to all items to estimate their position on the ability scale. If a student responds correctly to most items, the model infers that their ability is likely above the hardest item’s difficulty. This is because scale scores are continuous and can extend above (and below) the range of item difficulties, and the estimate reflects the student’s probability of success on harder items, even if those items weren’t on the test.

Because student ability scale scores and item difficulties are expressed on the same scale, they can be used to estimate the likelihood that a student will respond correctly to a given item.

  • If an item has a difficulty rating that is higher than a student’s scale score, the probability of a correct response is low. When a student’s scale score is at the same level as an item’s difficulty, the probability of success is about 50%. For example, a student with a scale score of 140 has a 50% chance of responding correctly to an item that has a difficulty score of 140.
  • As the difference between the student’s scale score and the item difficulty increases, the probability of success changes. Where the student’s scale score is much higher than the item difficulty, the student is much more likely to succeed. For instance, if the same student faces an item with difficulty 120, their chance of success is well above 50%, and even higher for an item with difficulty 100. Where the student’s scale score is much lower than the item difficulty, they are less likely to succeed.

This relationship is modelled mathematically by the Rasch model, which uses a logistic function to link ability and item difficulty to the probability of a correct response.

 

Confidence bands

In cases where nearly every item on a test is answered correctly, the confidence band around the student's scale score will be quite wide. This is because the confidence band represents the margin of error associated with the scale score. The margin of error widens when there is a lack of evidence as to what a high-achieving student cannot do.  In contrast, the margin of error is narrower when there is a balanced mix of right and wrong answers – in other words, a solid body of evidence as to what the student can and cannot yet do.  

This difference can be observed in the narrow confidence band in Report 1 above, where 50% of the questions were answered correctly. In contrast, the wide confidence band in Report 2 is due to the fact that all questions were answered correctly. 

A wide confidence band means that it is important to interpret a student’s result carefully. For example, it's quite possible that upon retesting the student, you may see the scale score decrease to a more realistic level.  In such cases the conclusion would not be that the student has moved backwards, but that a clearer understanding of the student’s ability has been gained, based on targeting the student with test questions that lie closer in difficulty to the student’s actual level of ability. 

However, a wide confidence band still provides useful information about the student’s range of ability. The confidence band can also be interpreted as the student’s ‘Zone of Proximal Development’ (ZPD). This means that we would expect the student to correctly answer test questions at this level only 50% of the time. Consulting resources like the PAT band descriptors relevant to the range indicated by the confidence band can help build the picture of the student’s current ability. 

ACER does not recommend retesting solely on the basis of a wide confidence band. If you are considering re-testing a student who has recently completed a PAT assessment, please refer to our article below: 

Should I re-test my student based on high or low raw scores?

Was this article helpful?
0 out of 0 found this helpful

Articles in this section

ACER News
News and expert insights on educational research and developments around the world
Magazine
Promoting quality teaching and leading, assisting school improvement at a grassroots level
Podcast
Honest conversations with leading educational practitioners