Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
From Dr. Google to Dr. ChatGPT: Evaluating the Performance of Large Language Models as a Tool for Primary Care Consultations (Preprint)
0
Zitationen
10
Autoren
2026
Jahr
Abstract
<sec> <title>BACKGROUND</title> Online health information seeking (OHIS) plays a vital role in individuals’ self-management of health. These include understanding the symptomatology of a potential illness, improving health-related habits, assessing possible risks, and determining whether to seek medical care. Since the release of the first ChatGPT model in 2022, large language models (LLMs) have evolved significantly, and an increasing number of users now turn to these generative information systems for inquiries as sensitive and consequential as those related to health. </sec> <sec> <title>OBJECTIVE</title> This study presents a detailed analysis of the performance of widely used contemporary large language models in addressing common health-related queries posed by online users. The primary objective is to identify the main strengths and weaknesses of generative AI systems when responding to information needs as critical as those arising in the health domain. </sec> <sec> <title>METHODS</title> The study was structured using a question–answer format, in which each question corresponded to a user query and each answer represented the output generated by a model in response. The set of queries was derived from the most frequently searched terms on a major web search engine, reflecting real users’ health-related information needs. The study employed a human evaluation framework involving two distinct panels of clinical experts from different specialties. The first panel selected the queries deemed most relevant and clinically significant. The selected queries were then submitted to different LLMs, after which the second panel of experts evaluated the responses generated. The evaluation criteria encompassed three dimensions: adherence to medical consensus; presence or absence of inappropriate or incorrect information; and the potential to cause harm to users. ChatGPT-4, Llama 3, and MedLlama 3 were selected as three representative systems for the experiments. </sec> <sec> <title>RESULTS</title> The findings indicate that the models performed reasonably well across the three evaluated dimensions. Based on aggregated statistics of the three models analyzed, 80.4% of the responses adhered to medical consensus, 85.0% provided clinically accurate information, and 100.0% posed no potential harm to users. GPT-4 and MedLlama 3 demonstrated superior performance compared to the base Llama 3 model, primarily due to Llama 3’s higher proportion of clinically incorrect responses and tendency to generate ambiguous answers in which medical consensus was not clearly reflected. Despite these relatively strong performance metrics, the healthcare domain requires particularly high standards; therefore, effectiveness levels of 80–90% remain insufficient for deployment in clinical environments. </sec> <sec> <title>CONCLUSIONS</title> This study reinforces the potential of LLMs as tools for online health information seeking among non-expert users. However, the performance limitations identified underscore the need for further studies to monitor the future development of these models. Moreover, the use of generative AI systems by individuals without medical expertise should remain limited to supportive or preliminary information-gathering purposes and should never replace consultation with a healthcare professional. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.