Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing
12
Zitationen
43
Autoren
2022
Jahr
Abstract
Training and evaluating language models increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical language modeling remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot language model evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical
Ähnliche Arbeiten
Autoren
- Jason Fries
- Leon Weber
- Natasha Seelam
- Gabriel Altay
- Debajyoti Datta
- Samuele Garda
- Myungsun Kang
- Ruisi Su
- Wojciech Kusa
- Samuel Cahyawijaya
- Fabio Barth
- Simon Ott
- Matthias Samwald
- Stephen Bach
- Stella Biderman
- Mario Sänger
- Bo Wang
- Alison Callahan
- Daniel León Periñán
- Théo Gigant
- Patrick Haller
- Jenny Chim
- Jose David Posada
- John Michael Giorgi
- Karthik Rangasai Sivaraman
- Marc Pàmies
- Marianna Nezhurina
- Robert Martin
- Michael Cullan
- Moritz Freidank
- Nathan Dahlberg
- Shubhanshu Mishra
- Shamik Bose
- Nicholas Michio Broad
- Yanis Labrak
- Shlok S Deshmukh
- Sid Kiblawi
- Ayush Singh
- Minh Chien Vu
- Trishala Neeraj
- Jonas Golde
- A. Villanova del Moral
- Benjamin Beilharz
Institutionen
- Stanford University(US)
- Max Delbrück Center(DE)
- Humboldt-Universität zu Berlin(DE)
- Tempus Labs (United States)(US)
- University of Virginia(US)
- Immuneering (United States)
- Brown University(US)
- Hong Kong University of Science and Technology(HK)
- University of Vienna(AT)
- Booz Allen Hamilton (United States)(US)
- Laboratoire Informatique d'Avignon(FR)