Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Comparative Performance Analysis of Large Language Models for Structured Data Processing: An Evaluation Framework Applied to Bibliometric Analysis
0
Zitationen
6
Autoren
2026
Jahr
Abstract
The proliferation of Large Language Models (LLMs) has transformed natural language processing (NLP) applications across diverse domains. This paper presents a comprehensive comparative analysis of three state-of-the-art language models—GPT-4o, Claude-3, and Julius AI—evaluating their performance across systematic NLP tasks using standardized datasets and evaluation frameworks. We introduce a reusable evaluation methodology incorporating five distinct prompt engineering techniques (Prefix, Cloze, Anticipatory, Heuristic, and Chain of Thought) applied to three categories of linguistic challenges: data extraction, aggregation, and contextual reasoning. Using a bibliometric analysis use case as our evaluation domain, we demonstrate the framework’s application to structured data processing tasks common in academic research, business intelligence, and data analytics applications. Our experimental design utilized a curated Scopus bibliographic dataset containing 3212 academic publications to ensure reproducible and objective comparisons, representing structured data processing tasks. The results demonstrated significant performance variations across models and tasks, with GPT-4o achieving 89.3% average accuracy, Julius AI reaching 85.7%, and Claude-3 demonstrating 72.1%. The results demonstrated significant performance variations across models and tasks, with Claude-3 showing notably high prompt sensitivity (consistency score: 74.3%, compared with GPT-4o: 91.2% and Julius AI: 86.7%). This study revealed critical insights into prompt sensitivity, contextual understanding limitations, and the effectiveness of different prompting strategies for specific task categories. Statistical analysis using repeated measures ANOVA and pairwise t-tests with Bonferroni’s correction confirmed significant differences between models (F(2, 132) = 142.3, p < 0.001), with effect sizes ranging from 0.51 to 1.33. Response time analysis showed task-dependent latency patterns: for data extraction tasks, Claude-3 averaged 1.9 s (fastest), GPT-4o 2.1 s, and Julius AI 2.8 s; however, for contextual reasoning tasks, latency increased as follows for Claude-3: 3.8 s, GPT-4o: 4.5 s, and Julius AI: 5.8 s. Overall averages were as follows for GPT-4o: 3.2 s, Julius AI: 4.1 s, and Claude-3: 2.8 s. While specific performance metrics reflect current model versions (GPT-4o: gpt-4o-2024-05-13; Claude-3 Opus: 20240229; Julius AI: v2.1.4), the evaluation framework provides a reusable methodology for ongoing LLM assessment as new versions emerge. These findings provide practical guidance for researchers and practitioners in selecting appropriate LLMs for domain-specific applications and highlight areas requiring further development in language model capabilities. While demonstrated on bibliometric data, this evaluation framework is generalizable to other structured data processing domains.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.260 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.116 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.493 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.438 Zit.