OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 16:32

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparing ChatGPT, Gemini, and Emerging LLMs in Low-Resource Educational Settings: Reasoning Quality, Consistency, and Explainability

2026·0 Zitationen·IEEE AccessOpen Access
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2026

Jahr

Abstract

Large Language Models such as ChatGPT, Gemini, and recent open-source systems have shown strong performance in natural language tasks. However, their reliability and pedagogical suitability in low-resource educational settings—especially in high-stakes multiple-choice exams that require explicit reasoning—remain underexplored. This paper benchmarks six LLMs—ChatGPT (GPT-4), Gemini 1.5 Pro, Microsoft Copilot, DeepSeek-V2 (16B), Qwen 3 (14B), and Phi-2 (2.7B)—on Turkish and Indonesian exam questions. The evaluation uses 2,000 items in total: 1,000 Turkish graduate admission questions and 1,000 Indonesian vocational questions. We assess (i) answer accuracy, (ii) justification quality (coherence, option comparison, reasoning depth, and Bloom-level alignment), and (iii) temporal consistency under repeated prompting. Expert-written rationales are used as a reference for human alignment. Quantitative analysis combines multilingual semantic similarity, lexical overlap, and coherence deviation measures, supported by SHAP-based feature attribution and UMAP visualization to examine explanation patterns across models. The results show clear variation in both correctness and justification behavior. GPT-4 produces the most consistently high-quality and pedagogically aligned explanations, while Gemini and Qwen 3 achieve competitive but less stable performance across cognitive categories. DeepSeek-V2 performs strongly on several reasoning metrics but shows lower temporal stability, whereas Copilot and Phi-2 are weaker on higher-order reasoning. These findings provide a practical evaluation framework for multilingual examstyle reasoning and highlight risks such as fluent but incorrect explanations and instability across runs, which should be considered when deploying LLMs in formal assessment contexts.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationEthics and Social Impacts of AIExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen