Kubernetes scheduling: Taxonomy, ongoing issues and challenges

C Carrión - ACM Computing Surveys, 2022 - dl.acm.org
Continuous integration enables the development of microservices-based applications using
container virtualization technology. Container orchestration systems such as Kubernetes …

[HTML][HTML] Deep neural networks in the cloud: Review, applications, challenges and research directions

KY Chan, B Abu-Salih, R Qaddoura, AZ Ala'M… - Neurocomputing, 2023 - Elsevier
Deep neural networks (DNNs) are currently being deployed as machine learning technology
in a wide range of important real-world applications. DNNs consist of a huge number of …

Netllm: Adapting large language models for networking

D Wu, X Wang, Y Qiao, Z Wang, J Jiang, S Cui… - Proceedings of the …, 2024 - dl.acm.org
Many networking tasks now employ deep learning (DL) to solve complex prediction and
optimization problems. However, current design philosophy of DL-based algorithms entails …

A survey of Kubernetes scheduling algorithms

K Senjab, S Abbas, N Ahmed, AR Khan - Journal of Cloud Computing, 2023 - Springer
As cloud services expand, the need to improve the performance of data center infrastructure
becomes more important. High-performance computing, advanced networking solutions …

Deep learning workload scheduling in gpu datacenters: A survey

Z Ye, W Gao, Q Hu, P Sun, X Wang, Y Luo… - ACM Computing …, 2024 - dl.acm.org
Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The
development of a DL model is a time-consuming and resource-intensive procedure. Hence …

Deep learning workload scheduling in gpu datacenters: Taxonomy, challenges and vision

W Gao, Q Hu, Z Ye, P Sun, X Wang, Y Luo… - arxiv preprint arxiv …, 2022 - arxiv.org
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL
model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU …

Preemptive all-reduce scheduling for expediting distributed DNN training

Y Bao, Y Peng, Y Chen, C Wu - IEEE INFOCOM 2020-IEEE …, 2020 - ieeexplore.ieee.org
Data-parallel training is widely used for scaling DNN training over large datasets, using the
parameter server or all-reduce architecture. Communication scheduling has been promising …

Cluster resource scheduling in cloud computing: literature review and research challenges

W Khallouli, J Huang - The Journal of supercomputing, 2022 - Springer
Scheduling plays a pivotal role in cloud computing systems. Designing an efficient
scheduler is a challenging task. The challenge comes from several aspects, including the …

AI-based resource management in beyond 5G cloud native environment

A Boudi, M Bagaa, P Pöyhönen, T Taleb… - IEEE Network, 2021 - ieeexplore.ieee.org
5G system and beyond targets a large number of emerging applications and services that
will create extra overhead on network traffic. These industrial verticals have aggressive …

Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs

Q Hu, M Zhang, P Sun, Y Wen, T Zhang - Proceedings of the 28th ACM …, 2023 - dl.acm.org
While recent deep learning workload schedulers exhibit excellent performance, it is arduous
to deploy them in practice due to some substantial defects, including inflexible intrusive …