Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT

2025·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Abstract Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82% and 70% for PRSGPT and 6% and 18% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use. Key Points We present a comprehensive 9-step pipeline for fine-tuning large language models on bioinformatics data, demonstrated through PRSGPT (PRS tools) and BioStarsGPT (forum discussions). The pipeline generated over 28,000 QA pairs for PRSGPT from diverse sources (PDFs, GitHub repositories, research articles) and 154,282 QA pairs for BioStarsGPT from community forum discussions. Qwen2.5-7B emerged as the best-performing model, achieving BLEU-4 improvements of 82% for PRSGPT and 6% for BioStarsGPT, with ROUGE-1 improvements of 70% and 18% respectively. The methodology integrates automated QA generation via Google Gemini, NLI for quality control, semantic deduplication, and parameter-efficient fine-tuning using LoRA, providing a scalable framework for domain-specific LLM adaptation.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationTopic ModelingComputational and Text Analysis Methods

Volltext beim Verlag öffnen

An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen