OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.03.2026, 01:05

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Beyond structured knowledge: performance boundaries of ChatGPT in geological-hazard question answering and the need for human-in-the-loop oversight

2026·0 Zitationen·Frontiers in Earth ScienceOpen Access
Volltext beim Verlag öffnen

0

Zitationen

9

Autoren

2026

Jahr

Abstract

Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ChatGPT-4o for geological-hazard question answering using a transparent, rubric-based design. Sixty questions spanning six task categories (C1-C6) were posed within a fixed time window under a controlled single-turn protocol, and eight evaluators with geohazard expertise independently rated each response on six capability dimensions (D1 Knowledge Coverage; D2 Comprehension and Reasoning; D3 Accuracy and Rigor; D4 Critical Thinking; D5 Application and Context Adaptability; D6 Innovation and Knowledge Expansion). Scores were assigned on a continuous 0–1 scale, with 0, 0.5, and 1 used as anchor points to guide interpretation. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). Performance was consistently higher on structured knowledge tasks—defined here as questions with well-established concepts, factual grounding, or clearly bounded reasoning paths (C1 = 0.827; C2 = 0.797; C3 = 0.818) than on open-ended tasks (C4-C6 mean = 0.591). Across dimensions, scores were highest for D1 (0.868), D2 (0.864), and D3 (0.830), and lowest for D4 (0.578) and D6 (0.550). Overall agreement was good (ICC (3, k) = 0.8095), while agreement decreased for more subjective tasks and dimensions. The study provides (i) a baseline, version-specific appraisal of GPT-4o in geohazard-related QA, (ii) a transferable rubric-based workflow for evaluating domain LLMs, and (iii) evidence that human oversight remains essential when such systems are used to support safety-critical disaster risk reduction decisions.

Ähnliche Arbeiten