Towards transparent AI grading: Entropy as a signal for human-AI disagreement

Karrtik Iyer ,

Shayan Mohanty ,

Manikandan Ravikiran and

Prasanna Pendse

Published: August 06, 2025

Automated grading system can quickly score short-answer questions but they don’t always score when the decision is uncertain or controversial. This work proposes a novel method called “Semantic entropy” that measures how different GPT-4 explanations for the same students are especially when the human-graders have a disagreement. This work similar explanations and calculate how diverse these groups are without just looking at the final scores. Three research questions are addressed. They are:

Does semantic entropy align with human grader disagreement?
Does it generalize across academic subjects?
Is it sensitive to structural task features such as source dependency?

Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. These results underscore semantic entropy’s potential as a domain- and task-sensitive signal for triaging ambiguous or contentious grading cases in educational settings.