Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

2025·4 Zitationen·JMIR AIOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Background: The proliferation of both general purpose and health care-specific large language models (LLMs) has intensified the challenge of effectively evaluating and comparing them. Data contamination plagues the validity of public benchmarks, self-preference distorts LLM-as-a-judge approaches, and there is a gap between the tasks used to test models and those used in clinical practice. Objective: In response, we propose CLEVER (Clinical Large Language Model Evaluation-Expert Review), a methodology for blind, randomized, preference-based evaluation by practicing medical doctors on specific tasks. Methods: We demonstrate the methodology by comparing GPT-4o (OpenAI) against 2 health care-specific LLMs, with 8 billion and 70 billion parameters, over 3 tasks: clinical text summarization, clinical information extraction, and question answering on biomedical research. Results: Medical doctors prefer the medical model-small LLM trained by John Snow Labs over GPT-4o 45% to 92% more often on the dimensions of factuality, clinical relevance, and conciseness. Conclusions: The models show comparable performance on open-ended medical question answering, suggesting that health care-specific LLMs can outperform much larger general purpose LLMs in tasks that require understanding of clinical context. We test the validity of CLEVER evaluations by conducting interannotator agreement, interclass correlation, and washout period analysis.

Autoren

Institutionen

Themen

Topic ModelingText Readability and SimplificationArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen