Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
MedResearchBench: A Multi-Domain Benchmark for Evaluating AI Research Agents on Clinical Medical Research
0
Zitationen
2
Autoren
2026
Jahr
Abstract
Abstract The rapid advancement of AI research automation systems—including AI Scientist, data-to-paper, and Agent Laboratory—has demonstrated the potential for autonomous scientific discovery. However, existing benchmarks for evaluating these systems focus pre-dominantly on fundamental sciences (machine learning, physics, chemistry), overlooking the unique challenges of medical clinical research: complex survey designs, inferential statistics with confounding control, adherence to reporting standards (STROBE, CONSORT), and the requirement for clinically actionable interpretation. We present MedResearchBench , the first benchmark specifically designed to evaluate AI systems on medical clinical research tasks. MedResearchBench comprises 16 tasks spanning 7 clinical domains (cardiovascular, oncology, mental health, metabolic, respiratory, neurology, infectious disease), built on publicly available datasets (the National Health and Nutrition Examination Survey [NHANES] and the Surveillance, Epidemiology, and End Results [SEER] program) with ground truth from 16 high-quality published papers (IF range: 2.3–51.0). Each task is evaluated along 6 medical-specific dimensions: statistical methodology, results accuracy, visualization quality, clinical interpretation, confounding sensitivity, and reporting compliance. We describe the benchmark design rationale, task construction methodology, paper selection criteria with anti-paper-mill filtering, and a detailed analysis of task characteristics including methodological diversity, evaluation dimension coverage, and difficulty stratification. To demonstrate benchmark executability, we evaluate an agentic data2paper pipeline on 3 pilot tasks spanning all three difficulty tiers, achieving scores of 72/100 (Tier 1, Cardio 000), 69/100 (Tier 2, Mental 000), and 75/100 (Tier 3, Metabolic 002), with a mean score of 72/100 (B-level). Survey-weighted methodology was correctly implemented across all tasks; primary limitations included covariate incompleteness and reference group misspecification. MedResearchBench addresses a critical gap in AI research evaluation and provides a standardized, community-extensible platform for assessing whether AI systems can conduct clinically sound, publication-quality medical research. All task materials are publicly available at https://github.com/TerryFYL/MedResearchBench .
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.578 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.470 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.984 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.814 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.