Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Bioberturk: Exploring Turkish Biomedical Language Model Development Strategies in Low Resource Setting

2022·4 Zitationen·Research Square (Research Square)Open Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2022

Jahr

Abstract

Abstract Pretrained language models elevated with in-domain corpora show impressive results in biomedicine and clinical NLP tasks in English. However, there is minimal work in low resource languages. Although some pioneering works show promising results, many scenarios still need to be explored to engineer effective pretrained language models in biomedicine for low resource settings. This work introduces the BioBERTurk family, four pretrained models in Turkish for biomedicine. To evaluate models, we also introduce a labeled dataset to classify radiology reports of head CT exams. Two different parts of the reports, impressions, and findings, are evaluated separately to observe the performance of models on longer and less informative text. We compare models with the Turkish BERT-BERTurk pretained with general domain text, multilingual BERT, and an LSTM+attention-based baseline model. The first model initialized from BERTurk and then further pretrained with biomedical corpus performs statistically better than BERTurk, multilingual BERT, and baseline for both datasets. The second model continues to pretrain BERTurk model by using only radiology Ph.D. theses to test the effect of the task-related text. This model slightly outperforms all models on the impressions dataset and showed that using only radiology-related data for continual pretraining could be effective. The third model continues to pretrain by adding radiology theses to biomedical corpus but does not show a statistically meaningful difference. The final model combines radiology and biomedicine corpora with the corpus of BERTurk and pretrained a BERT model from scratch. This model is the worst performed model of the BioBERT family, even worse than BERTurk and multilingual BERT.

Autoren

Institutionen

Ege University(TR)

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationNatural Language Processing Techniques

Volltext beim Verlag öffnen

Bioberturk: Exploring Turkish Biomedical Language Model Development Strategies in Low Resource Setting

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen