Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Hu, Qinghao; Sun, Peng; Yan, Shengen; Wen, Yonggang; Zhang, Tianwei

doi:10.1145/3458817.3476223

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2109.01313 (cs)

[Submitted on 3 Sep 2021 (v1), last revised 6 Sep 2021 (this version, v2)]

Title:Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Authors:Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang

View PDF

Abstract:Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.

Comments:	This paper has been accepted by the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21), Nov 14-19, 2021, St. Louis, USA
Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
Cite as:	arXiv:2109.01313 [cs.DC]
	(or arXiv:2109.01313v2 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2109.01313
Related DOI:	https://doi.org/10.1145/3458817.3476223

Submission history

From: Qinghao Hu [view email]
[v1] Fri, 3 Sep 2021 05:02:52 UTC (620 KB)
[v2] Mon, 6 Sep 2021 01:26:38 UTC (311 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators