OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 31.03.2026, 03:43

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Cost-Efficient Prompt Routing in Large Language Model Inference Using BERT-Based Difficulty Prediction

2025·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2025

Jahr

Abstract

Large language models (LLMs) like GPT-4, Claude 3.5, and Gemini 2.5 are now widely used in natural language processing tasks. But their high inference cost and computing requirements make it hard to scale them, especially in areas with limited resources such as education, customer support, and interactive AI tools. This work proposes a cost-efficient inference setup that uses semantic embeddings from a BERT-based model, a LightGBM classifier to estimate prompt difficulty, and dynamic routing logic to pick the most appropriate LLM based on a balance between cost and accuracy. The system was tested on 990 prompts taken from OpenBookQA (Easy), GSM8K (Medium), and MMLU (Hard). It achieved a weighted F1 score of 0.89, while cutting average inference cost by over 95% from $12.08 to $0.34 per 1,000 prompts compared to always using high-end models. Both semantic embedding and routing logic were crucial in maint aining good performance at cheap cost, according to ablation stu dies, demonstrating that promptaware routing can be applied to real-world LLM inference systems and scale well. This research also evaluate routing framework end-to-end on a mixed workload of conversational, summarisation and code-generation tasks to measure its impact on latency and throughput. Across 1,200 real-world prompts, the system achieves a 40% reduction in average inference time and routes over 70% of queries to lower-capacity models without dropping overall response quality-maintaining an average human-judged acceptability score within 2% of always using the top-tier model. Finally, we release our implementation as a set of Dockerised microservices, complete with pre-trained difficulty predictors and REST endpoints, to simplify integration into existing LLM pipelines.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationSoftware System Performance and Reliability
Volltext beim Verlag öffnen