OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 24.03.2026, 13:54

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Are Today’s Large Language Models Ready to Take Over SNOMED Coding: A pilot study. (Preprint)

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

5

Autoren

2025

Jahr

Abstract

<sec> <title>UNSTRUCTURED</title> High-quality semantic coding to standardized terminologies like SNOMED CT is essential for interoperability and the secondary use of health data for research and artificial intelligence applications. This task is typically done manually by trained coders, but recent advances in large language models (LLMs) have raised the question of whether these models can approximate expert-level performance. This study explores the capabilities of LLMs in semantic mapping using the German Corona Consensus Dataset (GECCO), a harmonized dataset used for research in COVID-19, as a benchmark. We selected 22 medical elements from GECCO, each previously coded to SNOMED CT by two experts. This high-confidence dataset served as the gold standard. Three prompting strategies were tested—few shot examples, chain-of-thought prompting, and reasoning-based prompts—applied to LLMs frequently used in medical applications, including Claude 3.5 Haiku, ChatGPT-4o, Perplexity (default configuration), Gemini Flash 2.5, and DeepSeek V3. In addition, we evaluated two open-weight models, LLaMA 2 7B Chat and Mixtral 7B Instruct, which we ran locally using the Ollama framework. The models were instructed to provide fully specified names (FSN) and SNOMED CT IDs (SCTID). Model outputs were compared to the GECCO reference using the ISO/TS 21564 Equivalence Assessment Score. Statistical comparisons were performed using the Mann–Whitney U test. None of the LLMs matched the performance of the human-coded gold standard. The best result, achieved by ChatGPT-4o with a reasoning-based prompt, yielded a mean ISO score of 2.95. However, only 31% of outputs matched both the FSN and the SCTID correctly. All models generated hallucinated results. The locally run models failed to produce a single fully correct coding. In conclusion, while LLMs show potential for supporting medical coding, their current limitations—particularly hallucination and lack of precision—render them unsuitable for unsupervised use. High-quality coding still requires expert validation. Domain-specific fine-tuning and rigorous benchmarking will be crucial for future improvements. </sec>

Ähnliche Arbeiten

Autoren

Themen

Artificial Intelligence in Healthcare and EducationBiomedical Text Mining and OntologiesTopic Modeling
Volltext beim Verlag öffnen