Evaluating the Accuracy of Language Models in Automated Essay Grading Compared to Human Raters, (from page 20260426.)
External link
Keywords
- LLMs
- grading
- essays
- human scores
- feedback
- AI
Themes
- large language models
- automated essay scoring
- human grading
- evaluation
- feedback
Other
- Category: science
- Type: research article
Summary
This study examines the effectiveness of large language models (LLMs) in automated essay scoring compared to human grading. The evaluation reveals that LLMs exhibited weak agreement with human scores, influenced by essay characteristics. Specifically, LLMs tended to assign higher scores to shorter or less developed essays and lower scores to longer essays with minor errors. Furthermore, the feedback generated by LLMs was coherent with their grading, as more positively reviewed essays received higher scores. These findings indicate that while LLMs can provide consistent feedback, their scoring does not align well with human evaluation methods, suggesting limited effectiveness in replicating human grading behaviors. Nonetheless, LLMs can assist in the essay scoring process.
Signals
| name |
description |
change |
10-year |
driving-force |
relevancy |
| LLM Grading Agreement Issues |
LLMs show weak agreement with human grades across various essay characteristics. |
From reliance on human grading to use of automated essay scoring models. |
Automated essay scoring could become a common tool, but may lack human-like grading quality. |
Increase in demand for efficient and scalable grading solutions in education. |
4 |
| Variation in LLM Responses |
LLMs assign different scores based on essay length and quality, diverging from human grading. |
Shift from uniform grading standards to variable scoring by AI models. |
Grading paradigms may evolve to accommodate AI-driven assessment metrics. |
Need for personalized feedback mechanisms in educational assessments. |
5 |
| Feedback Consistency |
LLMs provide feedback that correlates with their scoring patterns, indicating coherence in their assessment process. |
Transition from traditional feedback methods to AI-driven, consistent feedback systems. |
Feedback mechanisms in education will increasingly utilize AI for consistency and reliability. |
Push for standardized and repeatable feedback processes in educational settings. |
4 |
| Challenges for Complex Essays |
LLMs assign lower scores to longer essays with minor errors, potentially misjudging quality. |
From nuanced human evaluation to AI scoring that may overlook complexity. |
Future grading may favor shorter, error-free content over comprehensive ideas. |
AI’s efficiency needing to balance with the need for critical and comprehensive evaluation. |
3 |
| Essay Scoring Automation |
LLMs can reliably assist in essay scoring, despite limitations in human-like grading. |
Gradual acceptance of AI in essay grading as a supportive tool rather than a complete solution. |
AI may play a central role in education with hybrid grading systems integrating AI and humans. |
Advancements in AI technologies and educational resource constraints. |
5 |
Concerns
| name |
description |
| Misalignment in Grading Standards |
LLM-generated scores do not align with human grading, leading to potential inaccuracies in academic assessments. |
| Overestimation of Short Essays |
LLMs tend to assign higher scores to underdeveloped essays, which could misrepresent student capabilities. |
| Underestimation of Longer Essays |
Higher scrutiny on longer essays for minor errors may unfairly penalize students demonstrating deeper understanding. |
| Reliance on Praise and Criticism Patterns |
The reliance on positive or negative feedback to determine scores may lead to biased evaluations. |
| Underdeveloped AI Grading Tools |
The inconsistency of LLM scores suggests that they are not yet reliable tools for standardized evaluation. |
Behaviors
| name |
description |
| LLM Grading Patterns |
LLMs demonstrate unique scoring patterns, favoring shorter essays while penalizing longer ones with minor errors, diverging from human grading methods. |
| Praise-Criticism Correlation |
LLMs provide score consistency aligned with their feedback, where positive comments correlate with higher scores and criticism correlates with lower scores. |
| Agreement Variability |
The agreement between LLM and human grading varies based on essay characteristics, indicating nuanced performance differences among LLMs. |
| Supportive Grading Tool |
Despite discrepancies, LLMs can be utilized as reliable support tools for essay scoring due to their coherent feedback patterns. |
Technologies
| name |
description |
| Large Language Models (LLMs) for Automated Essay Scoring |
LLMs are evaluated for their effectiveness in assessing essays compared to human grading, revealing strengths and weaknesses in scoring methodologies. |
| AI-Based Feedback Mechanisms |
AI-generated feedback is consistent with scoring patterns, providing meaningful insights to enhance writing skills for students. |
Issues
| name |
description |
| LLM Agreement with Human Grading |
The discrepancy between grades assigned by LLMs and human raters highlights a lack of alignment in grading methods. |
| Bias in Automated Grading Systems |
LLMs may favor shorter essays or those with minimal errors, potentially leading to biased assessments. |
| Feedback Consistency of LLMs |
LLMs generate feedback that aligns with their scoring, raising questions about the quality and reliability of their evaluations. |
| Dependency on Essay Characteristics |
The performance of LLMs in grading varies based on essay length and quality, indicating a need for specific training. |
| Role of AI in Education |
The use of LLMs in education for essay scoring may change traditional grading practices and raise ethical considerations. |