Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Interference-Aware Latency Prediction With Kernels For Deep Neural Network
0
Zitationen
4
Autoren
2022
Jahr
Abstract
With the popularity of artificial intelligence applications, deep neural network (DNN) inference workloads are becoming more common in cloud servers. To improve GPU utilization, a GPU executes multiple workloads simultaneously, inevitably leading to resource contention and increasing inference latency. We propose a kernel-based latency prediction method that can more accurately predict the latency in the case of mutual interference between multiple workloads. The method uses the kernel parameters decomposed during the DNN inference to predict the latency of each kernel. It obtains the impact of interference on each model by the amount of data exchanged between the L1 cache, L2 cache, and GPU memory during the execution of each model. We conduct experiments on popular models. The results show that compared with the state-of-the-art multi-model coexistence prediction method, our method reduces the average error by 52% when predicting the latency of a single model and by 62%, 51%, and 58% when predicting the co-location of two, three, and four models.
Ähnliche Arbeiten
Deep Residual Learning for Image Recognition
2016 · 216.020 Zit.
U-Net: Convolutional Networks for Biomedical Image Segmentation
2015 · 85.918 Zit.
ImageNet classification with deep convolutional neural networks
2017 · 75.547 Zit.
Very Deep Convolutional Networks for Large-Scale Image Recognition
2014 · 75.404 Zit.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2016 · 52.636 Zit.