OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 29.03.2026, 12:17

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Using Diagnostic Probing to Expose the Shallow Syntactic and Semantic Foundations of ChatGPT as a Large Language Model

2026·0 Zitationen·Zenodo (CERN European Organization for Nuclear Research)Open Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2026

Jahr

Abstract

The remarkable conversational fluency of OpenAI's ChatGPT often creates an illusion of deep linguistic understanding, prompting its adoption across diverse sectors. This study critically evaluates this purported knowledge by implementing a comprehensive battery of diagnostic probes grounded in theoretical linguistics. We designed a multi-phase series of controlled experiments targeting core syntactic phenomena, including hierarchical agreement, syntactic islands, and binding theory, alongside semantic phenomena such as logical operators, quantifier scope, and presupposition. The study evaluated both GPT-3.5-turbo and GPT-4 models via the OpenAI API using forced-choice grammaticality judgments, plausibility assessments, and Chain-of-Thought (CoT) analysis to measure accuracy, stability, and reasoning soundness. Quantitative results revealed significant performance degradation on complex linguistic structures, with accuracy on long-range dependencies and quantifier scope falling to 67% and 42% for GPT-3.5, respectively. While GPT-4 demonstrated quantitatively superior performance, it exhibited qualitatively similar failure patterns, indicating that scaling alone does not address fundamental limitations. Qualitative analysis of reasoning chains revealed frequent post-hoc rationalization, associative drift, and a reliance on surface-level pattern matching rather than sound logical deduction. The findings robustly demonstrate that ChatGPT's linguistic knowledge is shallow, statistically driven, and non-causal, failing to reliably implement abstract grammatical rules or compositional semantics. We conclude that a paradigm shift in large language model (LLM) evaluation is necessary, moving from broad, aggregate benchmarks to targeted, causal probes that diagnose specific architectural limitations. These findings have significant implications for AI safety, reliability, and the future development of genuinely intelligent systems, underscoring the need for architectural innovations beyond mere scaling of parameters and data.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationEthics and Social Impacts of AIExplainable Artificial Intelligence (XAI)
Volltext beim Verlag öffnen