OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 21.03.2026, 11:37

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Comparative Performance Analysis of Large Language Models for Structured Data Processing: An Evaluation Framework Applied to Bibliometric Analysis

2026·0 Zitationen·Applied SciencesOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

The proliferation of Large Language Models (LLMs) has transformed natural language processing (NLP) applications across diverse domains. This paper presents a comprehensive comparative analysis of three state-of-the-art language models—GPT-4o, Claude-3, and Julius AI—evaluating their performance across systematic NLP tasks using standardized datasets and evaluation frameworks. We introduce a reusable evaluation methodology incorporating five distinct prompt engineering techniques (Prefix, Cloze, Anticipatory, Heuristic, and Chain of Thought) applied to three categories of linguistic challenges: data extraction, aggregation, and contextual reasoning. Using a bibliometric analysis use case as our evaluation domain, we demonstrate the framework’s application to structured data processing tasks common in academic research, business intelligence, and data analytics applications. Our experimental design utilized a curated Scopus bibliographic dataset containing 3212 academic publications to ensure reproducible and objective comparisons, representing structured data processing tasks. The results demonstrated significant performance variations across models and tasks, with GPT-4o achieving 89.3% average accuracy, Julius AI reaching 85.7%, and Claude-3 demonstrating 72.1%. The results demonstrated significant performance variations across models and tasks, with Claude-3 showing notably high prompt sensitivity (consistency score: 74.3%, compared with GPT-4o: 91.2% and Julius AI: 86.7%). This study revealed critical insights into prompt sensitivity, contextual understanding limitations, and the effectiveness of different prompting strategies for specific task categories. Statistical analysis using repeated measures ANOVA and pairwise t-tests with Bonferroni’s correction confirmed significant differences between models (F(2, 132) = 142.3, p < 0.001), with effect sizes ranging from 0.51 to 1.33. Response time analysis showed task-dependent latency patterns: for data extraction tasks, Claude-3 averaged 1.9 s (fastest), GPT-4o 2.1 s, and Julius AI 2.8 s; however, for contextual reasoning tasks, latency increased as follows for Claude-3: 3.8 s, GPT-4o: 4.5 s, and Julius AI: 5.8 s. Overall averages were as follows for GPT-4o: 3.2 s, Julius AI: 4.1 s, and Claude-3: 2.8 s. While specific performance metrics reflect current model versions (GPT-4o: gpt-4o-2024-05-13; Claude-3 Opus: 20240229; Julius AI: v2.1.4), the evaluation framework provides a reusable methodology for ongoing LLM assessment as new versions emerge. These findings provide practical guidance for researchers and practitioners in selecting appropriate LLMs for domain-specific applications and highlight areas requiring further development in language model capabilities. While demonstrated on bibliometric data, this evaluation framework is generalizable to other structured data processing domains.

Ähnliche Arbeiten