Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Advanced Computing Frameworks for Distributed Training, Deployment, and Monitoring of Artificial Intelligence and Machine Learning Models
0
Zitationen
1
Autoren
2025
Jahr
Abstract
This study addresses a persistent operational problem in distributed AI and ML: many organizations can scale model training, but still experience reliability and governance breakdowns because training, deployment, and monitoring are implemented as fragmented toolchains rather than an end-to-end lifecycle system. The purpose was to synthesize and compare advanced computing frameworks that support distributed training, deployment, and monitoring, and to quantify which framework patterns most strongly align with dependable production operations. A quantitative, cross-sectional, case-based design was applied via a structured literature review in which each eligible publication was treated as a case instance, spanning cloud and enterprise environments including cloud, on-premises clusters, and hybrid deployments. The sample comprised 32 case instances (n = 32). Key variables included framework category prevalence and Lifecycle Integration Score (LIS), dominant architectural patterns and evidence scores, effectiveness outcomes (scalability, deployment reliability, monitoring, traceability), monitoring-maturity indicators, and unresolved-gap severity. The analysis plan combined deductive and inductive coding with frequency, mean, and cross-tabulation summaries (reported as means, SDs, and shares), supported by spreadsheet tools and SPSS. Headline findings show that training-centric frameworks were most prevalent (11/32, 34.4%) but had lower integration (LIS M = 2.9), while end-to-end lifecycle platforms (8/32, 25.0%) achieved the highest integration (LIS M = 4.3) with 87.5% of cases scoring ≥4. The most common architecture was data-parallel training with collective all-reduce (62.5%), followed by orchestration-first deployment (56.3%) and observability-by-design (50.0%). Cross-sectionally, training scalability effectiveness scored highest (M = 4.1; 65.6% ≥4), while deployment reliability control (M = 3.7) and monitoring effectiveness (M = 3.6) lagged, indicating that operational dependability remains constrained by lifecycle linkages. Monitoring maturity was strongest for service observability (p95/p99 latency and tracing, M = 3.9) but weaker for label-scarce performance degradation tracking (M = 3.4). The most severe gaps were interoperability fragmentation (M = 4.3; 75.0% ≥4) and unclear incident ownership (M = 4.2; 71.9% ≥4), implying that organizations should prioritize integrated platforms with traceability, governance linkage, and label-scarce monitoring proxies to translate scaling gains into stable operations.
Ähnliche Arbeiten
MapReduce
2008 · 18.445 Zit.
Suspending OpenMP Tasks on Asynchronous Events: Extending the Taskwait Construct
2023 · 12.930 Zit.
The NIST definition of cloud computing
2011 · 11.555 Zit.
Bitcoin: A Peer-to-Peer Electronic Cash System
2008 · 11.181 Zit.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
2016 · 9.726 Zit.