OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 17.03.2026, 18:40

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

¿Es bueno confiar en recomendaciones de la inteligencia artificial basadas en guías clínicas?

2025·0 Zitationen·Biomedica (INSTITUTO NACIONAL DE SALUD)Open Access
Volltext beim Verlag öffnen

0

Zitationen

10

Autoren

2025

Jahr

Abstract

Introduction. Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines.Objectives. To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics.Material and methods. A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer–concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models.Results. Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86). Conclusions. Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.

Ähnliche Arbeiten