Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency

Z Liu, S Cheng, H Zhou, Y You - … of the International Conference for High …, 2023‏ - dl.acm.org
Large-scale language models have become increasingly challenging and expensive to
train. Among various methods addressing this issue, Pipeline Parallelism has been widely …

Bpipe: Memory-balanced pipeline parallelism for training large language models

T Kim, H Kim, GI Yu, BG Chun - International Conference on …, 2023‏ - proceedings.mlr.press
Pipeline parallelism is a key technique for training large language models within GPU
clusters. However, it often leads to a memory imbalance problem, where certain GPUs face …

Baechi: fast device placement of machine learning graphs

B Jeon, L Cai, P Srivastava, J Jiang, X Ke… - Proceedings of the 11th …, 2020‏ - dl.acm.org
Machine Learning graphs (or models) can be challenging or impossible to train when either
devices have limited memory, or the models are large. Splitting the model graph across …

Merak: An efficient distributed dnn training framework with automated 3d parallelism for giant foundation models

Z Lai, S Li, X Tang, K Ge, W Liu, Y Duan… - … on Parallel and …, 2023‏ - ieeexplore.ieee.org
Foundation models are in the process of becoming the dominant deep learning technology.
Pretraining a foundation model is always time-consuming due to the large scale of both the …

Model: memory optimizations for deep learning

B Steiner, M Elhoushi, J Kahn… - … on Machine Learning, 2023‏ - proceedings.mlr.press
The size of deep neural networks has grown exponentially in recent years. Unfortunately,
hardware devices have not kept pace with the rapidly increasing memory requirements. To …

A Comparative Analysis of Distributed Training Strategies for GPT-2

I Patwardhan, S Gandhi, O Khare, A Joshi… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The rapid advancement in Large Language Models has been met with significant
challenges in their training processes, primarily due to their considerable computational and …

Characterizing multi-instance gpu for machine learning workloads

B Li, V Gadepally, S Samsi… - 2022 IEEE International …, 2022‏ - ieeexplore.ieee.org
As machine learning (ML) becomes more and more popular, datacenter operators use
hardware accelerators such as GPUs to tackle the high computation demand of ML …

Unicron: Economizing self-healing llm training at scale

T He, X Li, Z Wang, K Qian, J Xu, W Yu… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …

Automated tensor model parallelism with overlapped communication for efficient foundation model training

S Li, Z Lai, Y Hao, W Liu, K Ge, X Deng, D Li… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Deep learning is experiencing a rise in foundation models that are expected to lead in
various fields. The massive number of parameters necessitates the use of tensor model …

Comparative analysis of AWS model deployment services

R Bagai - arxiv preprint arxiv:2405.08175, 2024‏ - arxiv.org
Amazon Web Services (AWS) offers three important Model Deployment Services for model
developers: SageMaker, Lambda, and Elastic Container Service (ECS). These services …