OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.05.2026, 13:22

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Towards evaluating and building versatile large language models for medicine

2025·29 Zitationen·npj Digital MedicineOpen Access
Volltext beim Verlag öffnen

29

Zitationen

8

Autoren

2025

Jahr

Abstract

In this study, we present MedS-Bench, a comprehensive benchmark to evaluate large language models (LLMs) in clinical contexts, MedS-Bench, spanning 11 high-level clinical tasks. We evaluate nine leading LLMs, e.g., MEDITRON, Llama 3, Mistral, GPT-4, Claude-3.5, etc. and found that most models struggle with these complex tasks. To address these limitations, we developed MedS-Ins, a large-scale instruction-tuning dataset for medicine. MedS-Ins comprises 58 medically oriented language corpora, totaling 5M instances with 19K instructions, across 122 tasks. To demonstrate the dataset's utility, we conducted a proof-of-concept experiment by performing instruction tuning on a lightweight, open-source medical language model. The resulting model, MMedIns-Llama 3, significantly outperformed existing models on various clinical tasks. To promote further advancements, we have made MedS-Ins fully accessible and invite the research community to contribute to its expansion. Additionally, we have launched a dynamic leaderboard for MedS-Bench, to track the development progress of medical LLMs.

Ähnliche Arbeiten