Futures

Evaluating the Accuracy of Language Models in Automated Essay Grading Compared to Human Raters, (from page 20260426.)

External link

Keywords

LLMs
grading
essays
human scores
feedback
AI

Themes

large language models
automated essay scoring
human grading
evaluation
feedback

Other

Category: science
Type: research article

Summary

This study examines the effectiveness of large language models (LLMs) in automated essay scoring compared to human grading. The evaluation reveals that LLMs exhibited weak agreement with human scores, influenced by essay characteristics. Specifically, LLMs tended to assign higher scores to shorter or less developed essays and lower scores to longer essays with minor errors. Furthermore, the feedback generated by LLMs was coherent with their grading, as more positively reviewed essays received higher scores. These findings indicate that while LLMs can provide consistent feedback, their scoring does not align well with human evaluation methods, suggesting limited effectiveness in replicating human grading behaviors. Nonetheless, LLMs can assist in the essay scoring process.

Signals

name	description	change	10-year	driving-force	relevancy
LLM Grading Agreement Issues	LLMs show weak agreement with human grades across various essay characteristics.	From reliance on human grading to use of automated essay scoring models.	Automated essay scoring could become a common tool, but may lack human-like grading quality.	Increase in demand for efficient and scalable grading solutions in education.	4
Variation in LLM Responses	LLMs assign different scores based on essay length and quality, diverging from human grading.	Shift from uniform grading standards to variable scoring by AI models.	Grading paradigms may evolve to accommodate AI-driven assessment metrics.	Need for personalized feedback mechanisms in educational assessments.	5
Feedback Consistency	LLMs provide feedback that correlates with their scoring patterns, indicating coherence in their assessment process.	Transition from traditional feedback methods to AI-driven, consistent feedback systems.	Feedback mechanisms in education will increasingly utilize AI for consistency and reliability.	Push for standardized and repeatable feedback processes in educational settings.	4
Challenges for Complex Essays	LLMs assign lower scores to longer essays with minor errors, potentially misjudging quality.	From nuanced human evaluation to AI scoring that may overlook complexity.	Future grading may favor shorter, error-free content over comprehensive ideas.	AI’s efficiency needing to balance with the need for critical and comprehensive evaluation.	3
Essay Scoring Automation	LLMs can reliably assist in essay scoring, despite limitations in human-like grading.	Gradual acceptance of AI in essay grading as a supportive tool rather than a complete solution.	AI may play a central role in education with hybrid grading systems integrating AI and humans.	Advancements in AI technologies and educational resource constraints.	5

Concerns

name	description
Misalignment in Grading Standards	LLM-generated scores do not align with human grading, leading to potential inaccuracies in academic assessments.
Overestimation of Short Essays	LLMs tend to assign higher scores to underdeveloped essays, which could misrepresent student capabilities.
Underestimation of Longer Essays	Higher scrutiny on longer essays for minor errors may unfairly penalize students demonstrating deeper understanding.
Reliance on Praise and Criticism Patterns	The reliance on positive or negative feedback to determine scores may lead to biased evaluations.
Underdeveloped AI Grading Tools	The inconsistency of LLM scores suggests that they are not yet reliable tools for standardized evaluation.

Behaviors

name	description
LLM Grading Patterns	LLMs demonstrate unique scoring patterns, favoring shorter essays while penalizing longer ones with minor errors, diverging from human grading methods.
Praise-Criticism Correlation	LLMs provide score consistency aligned with their feedback, where positive comments correlate with higher scores and criticism correlates with lower scores.
Agreement Variability	The agreement between LLM and human grading varies based on essay characteristics, indicating nuanced performance differences among LLMs.
Supportive Grading Tool	Despite discrepancies, LLMs can be utilized as reliable support tools for essay scoring due to their coherent feedback patterns.

Technologies

name	description
Large Language Models (LLMs) for Automated Essay Scoring	LLMs are evaluated for their effectiveness in assessing essays compared to human grading, revealing strengths and weaknesses in scoring methodologies.
AI-Based Feedback Mechanisms	AI-generated feedback is consistent with scoring patterns, providing meaningful insights to enhance writing skills for students.

Issues

name	description
LLM Agreement with Human Grading	The discrepancy between grades assigned by LLMs and human raters highlights a lack of alignment in grading methods.
Bias in Automated Grading Systems	LLMs may favor shorter essays or those with minimal errors, potentially leading to biased assessments.
Feedback Consistency of LLMs	LLMs generate feedback that aligns with their scoring, raising questions about the quality and reliability of their evaluations.
Dependency on Essay Characteristics	The performance of LLMs in grading varies based on essay length and quality, indicating a need for specific training.
Role of AI in Education	The use of LLMs in education for essay scoring may change traditional grading practices and raise ethical considerations.