Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Evaluation of Large Language Models for Surgical Billing in Hallux Valgus Osteotomy
0
Zitationen
7
Autoren
2026
Jahr
Abstract
Category: Midfoot/Forefoot, Bunion Diabetes Keywords: Hallux Valgus, Bunion, Lapidus Introduction/Purpose: The process of generating current procedural terminology (CPT) codes for surgical billing is lengthy, suboptimal, and in flux with the advent of new procedures. Artificial intelligence (AI) automation could increase the accuracy of foot and ankle surgical billing, decrease turnaround time for reimbursements, lessen charting, and reduce billing overhead costs. Large language models (LLMs) interpret human language and demonstrate potential in producing CPT codes from operative notes. However, their potential in generating CPT codes from hallux valgus osteotomy operative notes remains unexplored. This study aims to evaluate the effectiveness of four publicly accessible LLMs in determining appropriate CPT codes from minimally invasive surgery (MIS), lapidus, and open hallux valgus osteotomy operative notes. Methods: 17 MIS, 17 lapidus, and 17 open hallux valgus osteotomy operative notes and their corresponding CPT codes billed by the hospital from 2 surgeons were collected from 2020 to 2025. 1 note from each procedure type was randomly selected as a training set. The remaining 48 notes were the testing set. Each LLM– ChatGPT 5.0 (ChatGPT), Copilot GPT5 (Copilot), Gemini 2.5 Flash (Gemini), and Claude Sonnet 4 (Claude) -was evaluated using three trials: prompted, classify, and learn. In prompted, each LLM was asked to generate CPT codes for the testing set. In learn, each LLM was presented with the training set before generating CPT codes for the testing set. In classify, each LLM generated CPT codes for the testing set based on a PDF of Foot and Ankle Procedure Code Reference. LLM-generated CPT codes were compared against those billed by the hospital to calculate exact match and F1. Results: The best-performing model for exact match was Claude with learning, which predicted an exact match 6.33% of the time on average. Regarding F1 scores, all LLMs performed similarly for the classify trial. However, Copilot performed significantly better in the prompted trial for MIS and lapidus than Gemini, Claude, and ChatGPT in F1 (Figure 1, p’s < 0.05). Copilot and Claude also performed significantly better in the learn trials for lapidus surgery compared to Gemini in F1 (Figure 1, p < 0.05). The greatest significant difference in F1 between prompting trials was for MIS surgical notes, where the learn trials resulted in a significant increase in F1 compared against classify and prompted trials across ChatGPT, Copilot, Gemini, and Claude (Figure 1, p’s < 0.05). Conclusion: LLMs demonstrated varying capabilities in generating CPT codes from hallux valgus operative notes. Providing training sets to LLMs and having consistent language in operative notes may improve LLM CPT code generation. Despite current F1s being low across LLMs, prompting using GPT5 may perform better than other models, showing potential for future evolutions of LLMs to assist in foot and ankle billing workflows. Reducing administrative burden and streamlining the billing process could provide more time for patient care and increase the affordability of surgery. This study provides a benchmark for the utilization of AI in surgical billing for hallux valgus surgery.
Ähnliche Arbeiten
The Levels of Evidence and Their Role in Evidence-Based Medicine
2011 · 2.102 Zit.
Quality guidelines for endodontic treatment: consensus report of the European Society of Endodontology
2006 · 1.176 Zit.
A NEW X-RAY TECHNIQUE and ITS APPLICATION TO ORTHODONTIA
2009 · 1.072 Zit.
THE JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY
1979 · 1.050 Zit.
Guidelines for Clinical Practice
1992 · 918 Zit.