OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 13.03.2026, 06:27

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Interference-Aware Latency Prediction With Kernels For Deep Neural Network

2022·0 Zitationen
Volltext beim Verlag öffnen

0

Zitationen

4

Autoren

2022

Jahr

Abstract

With the popularity of artificial intelligence applications, deep neural network (DNN) inference workloads are becoming more common in cloud servers. To improve GPU utilization, a GPU executes multiple workloads simultaneously, inevitably leading to resource contention and increasing inference latency. We propose a kernel-based latency prediction method that can more accurately predict the latency in the case of mutual interference between multiple workloads. The method uses the kernel parameters decomposed during the DNN inference to predict the latency of each kernel. It obtains the impact of interference on each model by the amount of data exchanged between the L1 cache, L2 cache, and GPU memory during the execution of each model. We conduct experiments on popular models. The results show that compared with the state-of-the-art multi-model coexistence prediction method, our method reduces the average error by 52% when predicting the latency of a single model and by 62%, 51%, and 58% when predicting the co-location of two, three, and four models.

Ähnliche Arbeiten