Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

2026·0 Zitationen·Frontiers in Earth ScienceOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ChatGPT-4o for geological-hazard question answering using a transparent, rubric-based design. Sixty questions spanning six task categories (C1-C6) were posed within a fixed time window under a controlled single-turn protocol, and eight evaluators with geohazard expertise independently rated each response on six capability dimensions (D1 Knowledge Coverage; D2 Comprehension and Reasoning; D3 Accuracy and Rigor; D4 Critical Thinking; D5 Application and Context Adaptability; D6 Innovation and Knowledge Expansion). Scores were assigned on a continuous 0–1 scale, with 0, 0.5, and 1 used as anchor points to guide interpretation. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Performance was consistently higher on structured knowledge tasks—defined here as questions with well-established concepts, factual grounding, or clearly bounded reasoning paths (C1 = 0.827; C2 = 0.797; C3 = 0.818) than on open-ended tasks (C4-C6 mean = 0.591). Across dimensions, scores were highest for D1 (0.868), D2 (0.864), and D3 (0.830), and lowest for D4 (0.578) and D6 (0.550). Overall agreement was good (ICC (3, k) = 0.8095), while agreement decreased for more subjective tasks and dimensions. The study provides (i) a baseline, version-specific appraisal of GPT-4o in geohazard-related QA, (ii) a transferable rubric-based workflow for evaluating domain LLMs, and (iii) evidence that human oversight remains essential when such systems are used to support safety-critical disaster risk reduction decisions.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingClinical Reasoning and Diagnostic Skills

Volltext beim Verlag öffnen

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen