Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Beyond BLEU: GPT-5, Human Judgment, and Classroom Validation for Multidimensional Machine Translation Evaluation
0
Zitationen
6
Autoren
2025
Jahr
Abstract
Recent progress in large language models (LLMs) has rekindled the promise of high-quality machine translation (MT), yet evaluation remains a bottleneck. Traditional automatic metrics (e.g., BLEU) are fast but fail to capture semantic and pragmatic nuances reflected in human judgments. We present a multidimensional framework—inspired by MQM—that augments standard metrics (Adequacy, Fluency) with three linguistic dimensions: Morphosyntactic, Semantic, and Pragmatic. We compare three Small Language Models for English→Indonesian: Qwen 3 (0.6B), LLaMA 3.2 (3B), and Gemma 3 (1B). Two controlled experiments are conducted: (i) Preliminary (1,000 translations, GPT-5-only scoring of Adequacy/Fluency + BLEU), and (ii) Final (100 translations, three human experts + GPT-5) on all five metrics. We compute inter-annotator reliability (Krippendorff’s α, weighted κ) and annotator competence (MACE). Results show consistent model ranking (Gemma 3 (1B) > LLaMA 3.2 (3B) > Qwen 3 (0.6B)) and strong GPT-5–human correlation (r = 0.822). To validate practical applicability, a classroom study with 26 translation students tested the metrics in real learning settings. Using the same multidimensional rubric, students rated MT outputs across pre-, post-, and final-test phases. Their mean absolute error (MAE) decreased from 0.97 to 0.83, while Exact Match Rate increased from 0.30 to 0.50 after rubric calibration, demonstrating that the proposed framework and GPT-5 evaluation can be effectively transferred to educational contexts for evaluator training and feedback alignment.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.496 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.386 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.848 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.562 Zit.