By Karrtik Iyer, Manikandan Ravikiran, Prasanna Pendse and Shayan Mohanty
Automated grading system can quickly score short-answer questions but they don’t always score when the decision is uncertain or controversial. This work proposes a novel method called “Semantic entropy” that measures how different GPT-4 explanations for the same students are especially when the human-graders have a disagreement. This work similar explanations and calculate how diverse these groups are without just looking at the final scores. Three research questions are addressed. They are:
Does semantic entropy align with human grader disagreement?
Does it generalize across academic subjects?
Is it sensitive to structural task features such as source dependency?
Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. These results underscore semantic entropy’s potential as a domain- and task-sensitive signal for triaging ambiguous or contentious grading cases in educational settings.
(Research submission here.)