Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Is One Run Enough? Reproducibility of Flagship Large Language Models Across Temperature and Reasoning Settings in Biomedical Text Processing
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Abstract Purpose: To quantify run-to-run reproducibility of Gemini 3 Flash Preview and GPT-5.2 for biomedical trial-success classification across temperature and reasoning/thinking settings, and to assess whether single-run reporting is sufficient. Methods: We utilized 250 randomized controlled oncology trial abstracts labeled POSITIVE/NEGATIVE based on primary endpoint success. With a fixed prompt requiring exactly "POSITIVE" or "NEGATIVE", we evaluated Gemini across thinking levels (minimal, low, medium, high) and temperatures 0.0 - 2.0, and GPT-5.2 across reasoning-effort levels (none to xhigh) with an additional temperature sweep when reasoning was disabled. Each setting was run three times. Reproducibility was quantified with Fleiss' κ across replicates, performance was summarized with F1 (per run and majority vote), and invalid-format outputs were recorded. Results: Gemini showed near-perfect agreement across settings (κ=0.942 - 1.000), including perfect agreement at temperature 0. Invalid outputs were uncommon (0 - 1.5%). GPT‑5.2 reproducibility was similarly high (κ=0.984 - 0.995) with no invalid outputs. Performance remained stable (mean/majority-vote F1 = 0.955 - 0.971), and majority voting offered only marginal gains. Conclusion: For strict binary biomedical classification with tightly constrained outputs, both models were highly reproducible across common decoding and reasoning configurations, indicating that one run is often adequate while minimal replication provides a practical stability check.
Ähnliche Arbeiten
Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support
2008 · 49.759 Zit.
Gene Ontology: tool for the unification of biology
2000 · 43.828 Zit.
STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
2018 · 18.768 Zit.
A translation approach to portable ontology specifications
1993 · 12.444 Zit.
Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research
2005 · 11.958 Zit.