Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Word Embedding for the French Natural Language in Health Care: Comparative Study
25
Zitationen
8
Autoren
2019
Jahr
Abstract
BACKGROUND: Word embedding technologies, a set of language modeling and feature learning techniques in natural language processing (NLP), are now used in a wide range of applications. However, no formal evaluation and comparison have been made on the ability of each of the 3 current most famous unsupervised implementations (Word2Vec, GloVe, and FastText) to keep track of the semantic similarities existing between words, when trained on the same dataset. OBJECTIVE: The aim of this study was to compare embedding methods trained on a corpus of French health-related documents produced in a professional context. The best method will then help us develop a new semantic annotator. METHODS: Unsupervised embedding models have been trained on 641,279 documents originating from the Rouen University Hospital. These data are not structured and cover a wide range of documents produced in a clinical setting (discharge summary, procedure reports, and prescriptions). In total, 4 rated evaluation tasks were defined (cosine similarity, odd one, analogy-based operations, and human formal evaluation) and applied on each model, as well as embedding visualization. RESULTS: Word2Vec had the highest score on 3 out of 4 rated tasks (analogy-based operations, odd one similarity, and human validation), particularly regarding the skip-gram architecture. CONCLUSIONS: Although this implementation had the best rate for semantic properties conservation, each model has its own qualities and defects, such as the training time, which is very short for GloVe, or morphological similarity conservation observed with FastText. Models and test sets produced by this study will be the first to be publicly available through a graphical interface to help advance the French biomedical research.
Ähnliche Arbeiten
"Why Should I Trust You?"
2016 · 14.610 Zit.
Coding Algorithms for Defining Comorbidities in ICD-9-CM and ICD-10 Administrative Data
2005 · 10.527 Zit.
A Comprehensive Survey on Graph Neural Networks
2020 · 8.878 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.447 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.944 Zit.
Autoren
Institutionen
- Laboratoire d'Informatique, du Traitement de l'Information et des Systèmes(FR)
- Université de Rouen Normandie(FR)
- Institut National des Sciences Appliquées Rouen Normandie(FR)
- Centre Hospitalier Universitaire de Rouen(FR)
- Département d'Informatique(FR)
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé
- Sorbonne Université(FR)