郭震 AI公众号:郭震AI

Realtime AI News

LLMs Score Close to Human Examiners on Real GCSE Mock Exam Benchmark

A new dataset of 32,534 double-marked real student GCSE responses shows top LLMs agree with examiner consensus nearly as closely as two examiners agree with each other.

Published/Reads 0

A study published on arXiv introduces a substantial benchmark for evaluating LLMs on UK GCSE exams. The dataset comprises 32,534 double-marked real student responses to mock exams across five subjects and 328 questions, including handwritten work.

Results show that off-the-shelf large language models agree well with examiner consensus across subjects, with top models approaching the inter-examiner agreement level. This is notable because the data comes from real student responses rather than artificially constructed answer sets.

The paper, published on arXiv cs.CL on June 25, 2026, contributes real-world evidence to the debate about AI-assisted assessment in education, suggesting LLMs may be viable as evaluation assistants.

Why it matters

Provides compelling real-world evidence of LLM potential in educational assessment, using authentic student exam data.

arXivBenchmarkLLMEducation

Sources