OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 10:07

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Do Large Language Models Read or Remember? Analyzing LLM Performance in Biomedical Text Mining With Progressive Content Removal and Counterfactual Results

2026·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

7

Autoren

2026

Jahr

Abstract

Abstract Purpose Large language models (LLMs) can classify biomedical documents accurately, but strong performance does not prove they are using the supplied text rather than identifier-triggered parametric knowledge. We tested whether oncology trial-success classification reflects “reading” of abstract evidence or “remembering” of known trials. Methods We used a corpus of 250 two-arm oncology randomized controlled trials from seven major journals (2005 - 2023) and asked the flagship models of three commercial vendors (OpenAI, Google, and Anthropic) to output a single label indicating whether the primary endpoint was met. For each trial we created five deterministic inputs: title+abstract (baseline), title-only, DOI-only, counterfactual title+abstract with the primary endpoint outcome minimally flipped, and the same counterfactual title+abstract paired with the original DOI to induce an identifier-text conflict. Results With full title+abstract, models achieved near-ceiling performance (accuracy and F1 Score 0.96 - 0.97) and high format adherence (97.2 - 100%). Performance degraded stepwise with content removal (title-only accuracy and F1 Score 0.79 - 0.88, DOI-only 0.63 - 0.67), consistent with above-chance identifier-driven signal. Under counterfactual results, models followed the edited evidence (accuracy and F1 Score 0.96 - 0.99 against inverted labels). Adding the real DOI minimally affected GPT (accuracy and F1 Score ≈ 0.99) but modestly reduced Gemini (accuracy and F1 Score ≈ 0.97) and Claude (accuracy and F1 Score ≈ 0.95), mainly via lower sensitivity. Conclusion LLMs robustly track explicit endpoint statements in abstracts, yet identifiers can support above-chance predictions and occasionally compete with textual evidence. Progressive ablations plus counterfactual conflicts provide a practical, reproducible audit for grounding in biomedical LLM evaluations.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationComputational and Text Analysis MethodsText Readability and Simplification
Volltext beim Verlag öffnen