Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparing ChatGPT, Gemini, and Emerging LLMs in Low-Resource Educational Settings: Reasoning Quality, Consistency, and Explainability
0
Zitationen
4
Autoren
2026
Jahr
Abstract
Large Language Models such as ChatGPT, Gemini, and recent open-source systems have shown strong performance in natural language tasks. However, their reliability and pedagogical suitability in low-resource educational settings—especially in high-stakes multiple-choice exams that require explicit reasoning—remain underexplored. This paper benchmarks six LLMs—ChatGPT (GPT-4), Gemini 1.5 Pro, Microsoft Copilot, DeepSeek-V2 (16B), Qwen 3 (14B), and Phi-2 (2.7B)—on Turkish and Indonesian exam questions. The evaluation uses 2,000 items in total: 1,000 Turkish graduate admission questions and 1,000 Indonesian vocational questions. We assess (i) answer accuracy, (ii) justification quality (coherence, option comparison, reasoning depth, and Bloom-level alignment), and (iii) temporal consistency under repeated prompting. Expert-written rationales are used as a reference for human alignment. Quantitative analysis combines multilingual semantic similarity, lexical overlap, and coherence deviation measures, supported by SHAP-based feature attribution and UMAP visualization to examine explanation patterns across models. The results show clear variation in both correctness and justification behavior. GPT-4 produces the most consistently high-quality and pedagogically aligned explanations, while Gemini and Qwen 3 achieve competitive but less stable performance across cognitive categories. DeepSeek-V2 performs strongly on several reasoning metrics but shows lower temporal stability, whereas Copilot and Phi-2 are weaker on higher-order reasoning. These findings provide a practical evaluation framework for multilingual examstyle reasoning and highlight risks such as fluent but incorrect explanations and instability across runs, which should be considered when deploying LLMs in formal assessment contexts.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.214 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.071 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.429 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.418 Zit.