Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

2025·0 Zitationen·ArXiv.orgOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

While large language models (LLMs) show transformative potential in healthcare, their development remains focused on high-resource languages. This creates a critical barrier for other languages, as simple translation fails to capture unique clinical and cultural nuances, such as endemic diseases. To address this, we introduce MedPT, the first large-scale, real-world corpus of patient-doctor interactions for the Brazilian Portuguese medical domain. Comprising 384,095 authentic question-answer pairs and covering over 3,200 distinct health-related conditions, the dataset was refined through a rigorous multi-stage curation protocol that employed a hybrid quantitative-qualitative analysis to filter noise and contextually enrich thousands of ambiguous queries, resulting in a corpus of approximately 57 million tokens. We further utilize of LLM-driven annotation to classify queries into seven semantic types to capture user intent. To validate MedPT's utility, we benchmark it in a medical specialty classification task: fine-tuning a 1.7B parameter model achieves an outstanding 94\% F1-score on a 20-class setup. Furthermore, our qualitative error analysis shows misclassifications are not random but reflect genuine clinical ambiguities (e.g., between comorbid conditions), proving the dataset's deep semantic richness. We publicly release MedPT on Hugging Face to support the development of more equitable, accurate, and culturally-aware medical technologies for the Portuguese-speaking world.

Autoren

Themen

Machine Learning in HealthcareTopic ModelingArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

MedPT: A Massive Medical Question Answering Dataset for Brazilian-Portuguese Speakers

Abstract

Ähnliche Arbeiten

Autoren

Themen