Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluating large language models for natural-language-to-code generation on aggregate Czech public health data analysis
0
Zitationen
7
Autoren
2025
Jahr
Abstract
Abstract Large language models (LLMs) are increasingly explored as tools for healthcare research and data analysis. However, their applicability to structured public health datasets, especially in non-English contexts, remains underexamined. We systematically evaluated 11 state-of-the-art LLMs on their ability to generate executable Python code for analytical queries over Czech public health datasets, focusing on incidence and prevalence data provided by the National Health Information Portal (known as NZIP). A set of representative analytical queries were designed, covering filtering, aggregation, weighted averages, and identification of primary diagnoses. Each model was prompted in Czech and assessed on code executability, correctness of results, and ability to adapt to local terminology. In the majority of cases, the models generated syntactically valid code within one minute, but performance varied. For the main objective of replicating “ground truth” queries as per dataset documentation, ChatGPT-4o achieved the highest accuracy, followed closely by GPT-4.1 mini. Claude and Gemini models frequently failed to apply critical filtering instructions, while Deepseek-R1, though accurate, defaulted to English output. Some models produced code that executed successfully but returned incorrect results, underscoring the need for systematic validation. Overall, LLMs show strong potential as coding assistants in public health analytics, even in Czech-language settings. Their integration into hybrid human–AI workflows, combined with validation mechanisms and retrieval-augmented generation, may accelerate the creation of reliable analytical pipelines.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.393 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.259 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.688 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.502 Zit.