Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Why high‐performing psychiatric machine learning models often fail in real‐world clinical settings?

2026·0 Zitationen·Psychiatry and Clinical Neurosciences

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Psychiatry is witnessing rapid growth in machine learning models that claim to estimate risk, prognosis, and treatment response from routinely collected clinical, cognitive, neurophysiological, and biological measures. These models are often presented as clinically actionable tools. However, across the literature, many are developed and evaluated within the same datasets, remain at high risk of bias, and are rarely tested under truly independent conditions. Performance metrics in these cases reflect how well the model fits the development data, not whether it will remain reliable when applied to new patients and clinical settings.1, 2 Existing general AI standards emphasize methodological transparency; they rarely confront the structural challenges unique to psychiatric prediction. Here, we provide a psychiatry-focused framing of common failure modes and propose minimum reporting expectations aligned with clinical claims. A recurring concern is that model development and evaluation are not sufficiently separated. When model structure and parameter tuning are influenced by the same data used to report final performance, apparent accuracy inflates. Performance metrics reflect how well the model fits the development data, not whether it will remain reliable when applied to new patients and clinical settings. The appropriate response is not to abandon predictive modeling, but to establish explicit correspondence between the level of validation achieved and the strength of claims permitted: exploratory conclusions may follow internal validation; claims of generalizability within related environments require structured internal separation (site-, temporal-, or participant-level splits); and claims of clinical decision support require external validation in truly independent settings.3 When external validation is not yet feasible, structured internal strategies, such as site-, temporal-, or participant-level separation, can approximate deployment conditions, but they do not provide the same level of evidentiary support as fully independent external validation4 Even models that satisfy all robustness criteria, structured validation, calibration, and external generalization may fail to change clinical practice. A well-calibrated model can still be ignored if clinicians distrust it, if implementation disrupts workflow, or if predicted risk does not align with treatment options. This gap between statistical robustness and clinical utility is critical in psychiatry, where clinical judgment remains central to decision-making. Three methodological practices require urgent change. First, random data splits must be replaced with structured separation that reflects real-world deployment: site-level splits test whether performance generalizes across clinical settings; temporal splits test stability over time; and participant-level splits test whether models work for genuinely new individuals. Second, model development must follow pre-specification: Defining the model structure and evaluation strategy before accessing test data prevents the selective reporting and post hoc tuning that systematically inflate apparent accuracy. Third, calibration must be explicitly assessed and reported. Models frequently achieve high discrimination, typically quantified by the area under the receiver operating characteristic curve (AUC), while producing poorly calibrated risk estimates, undermining clinical decision-making. Model probability estimates must be evaluated during development and reexamined under external validation. Without these safeguards, robust models risk becoming non-actionable tools that consume resources without improving outcomes, or eroding trust when they fail to deliver promised benefits.5 These concerns are not merely technical. Psychiatric prediction is a double-edged tool: When optimistic internal performance is mistaken for clinical reliability, false positives may lead to unnecessary monitoring and intervention, while false negatives may delay care. At a systems level, non-generalizable models can misdirect resources and distort service planning. Without calibration, uncertainty, and clear limits of applicability, choices support risks reinforcing automation bias and amplifying inequities. Poor pipeline discipline can therefore turn a promising translational tool into a non-actionable result, or a source of inequity when errors concentrate in subgroups that differ from the training distribution. We therefore propose that PCN and similar journals adopt psychiatry-focused minimum reporting standards for machine learning (Table 1) that align evidentiary strength with the level of clinical claims made. In a field characterized by diagnostic heterogeneity, probabilistic outcomes, and ethical sensitivity to error distribution, generic AI guidance is insufficient. By enforcing proportional validation, transparent reporting, and explicit limits of applicability, journals can help safeguard the boundary between exploratory modeling and clinical decision support. These standards do not restrain innovation; they legitimize it by ensuring that reported performance reflects durable signal rather than analytic artifact. The practical implication is unequivocal: Psychiatric machine learning will be judged not by statistical impressiveness, but by its reliability, fairness, and measurable benefit in real-world care.6 This Letter to Editor does not contain any studies involving human participants or animals performed by the authors and does not involve human participants, patient data, or experimental procedures. Ethical approval and informed consent were therefore not required. The authors report no conflicts of interest. Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

Autoren

Institutionen

Themen

Digital Mental Health InterventionsMachine Learning in HealthcareArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Why high‐performing psychiatric machine learning models often fail in real‐world clinical settings?

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen