Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language
0
Zitationen
2
Autoren
2026
Jahr
Abstract
<title>Abstract</title> In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.
Ähnliche Arbeiten
Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support
2008 · 49.838 Zit.
Gene Ontology: tool for the unification of biology
2000 · 43.864 Zit.
STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets
2018 · 18.780 Zit.
A translation approach to portable ontology specifications
1993 · 12.445 Zit.
Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research
2005 · 11.966 Zit.