Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

LLM-based medical dialogue dataset generation with automated instructions

2026·0 Zitationen·Scientific ReportsOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Constructing medical dialogue datasets poses significant challenges owing to legal and privacy concerns. In the wake of the advancement of large language models (LLMs), automated instruction generation grounded in LLMs has emerged as a promising approach for dataset construction. However, the existing methods often overlook the integration of domain knowledge, such as the standards and regulations stipulated in official documents. This renders the generated instructions and corpus of reduced value. In this work, we propose a new LLM-based automated instruction generation framework to build a medical dialogue dataset compliant with the guidelines of Medical Chinese Test (MCT). The framework involves the construction of a hand-crafted instruction set, corpus refinement, instruction sampling using maximum marginal relevance, and the K-means algorithm. By incorporating domain-specific knowledge and adopting instruction sampling strategy, the generated instructions and corpus basically meet the MCT standards. We tested this generation framework in the experiment with ChatGPT (gpt-3.5-turbo) and the medical LLM model Zuoyi, finding that compared to real-world medical dialogue datasets, the generated dataset MCT-Chat consisting of 20k examples demonstrates excellent performance in terms of both objective and subjective indicators.

Autoren

Institutionen

Huaqiao University(CN)

Themen

Topic ModelingMachine Learning in HealthcareArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

LLM-based medical dialogue dataset generation with automated instructions

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen