OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 16.05.2026, 21:08

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

TCGA-Reports: A machine-readable pathology report resource for benchmarking text-based AI models

2024·32 Zitationen·PatternsOpen Access
Volltext beim Verlag öffnen

32

Zitationen

2

Autoren

2024

Jahr

Abstract

In cancer research, pathology report text is a largely untapped data source. Pathology reports are routinely generated, more nuanced than structured data, and contain added insight from pathologists. However, there are no publicly available datasets for benchmarking report-based models. Two recent advances suggest the urgent need for a benchmark dataset. First, improved optical character recognition (OCR) techniques will make it possible to access older pathology reports in an automated way, increasing the data available for analysis. Second, recent improvements in natural language processing (NLP) techniques using artificial intelligence (AI) allow more accurate prediction of clinical targets from text. We apply state-of-the-art OCR and customized post-processing to report PDFs from The Cancer Genome Atlas, generating a machine-readable corpus of 9,523 reports. Finally, we perform a proof-of-principle cancer-type classification across 32 tissues, achieving 0.992 average AU-ROC. This dataset will be useful to researchers across specialties, including research clinicians, clinical trial investigators, and clinical NLP researchers.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Machine Learning in HealthcareRadiomics and Machine Learning in Medical ImagingAI in cancer detection
Volltext beim Verlag öffnen