Spatten: Efficient sparse attention architecture with cascade token and head pruning
The attention mechanism is becoming increasingly popular in Natural Language Processing
(NLP) applications, showing superior performance than convolutional and recurrent …
(NLP) applications, showing superior performance than convolutional and recurrent …
PanGu-: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation
Large-scale Pretrained Language Models (PLMs) have become the new paradigm for
Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as …
Natural Language Processing (NLP). PLMs with hundreds of billions parameters such as …
Capuchin: Tensor-based gpu memory management for deep learning
In recent years, deep learning has gained unprecedented success in various domains, the
key of the success is the larger and deeper deep neural networks (DNNs) that achieved very …
key of the success is the larger and deeper deep neural networks (DNNs) that achieved very …
SiP-ML: high-bandwidth optical network interconnects for machine learning training
This paper proposes optical network interconnects as a key enabler for building high-
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
bandwidth ML training clusters with strong scaling properties. Our design, called SiP-ML …
Towards efficient post-training quantization of pre-trained language models
Network quantization has gained increasing attention with the rapid growth of large pre-
trained language models~(PLMs). However, most existing quantization methods for PLMs …
trained language models~(PLMs). However, most existing quantization methods for PLMs …
Graph processing and machine learning architectures with emerging memory technologies: a survey
X Qian - Science China Information Sciences, 2021 - Springer
This paper surveys domain-specific architectures (DSAs) built from two emerging memory
technologies. Hybrid memory cube (HMC) and high bandwidth memory (HBM) can reduce …
technologies. Hybrid memory cube (HMC) and high bandwidth memory (HBM) can reduce …
Reconfigurability, why it matters in AI tasks processing: A survey of reconfigurable AI chips
Nowadays, artificial intelligence (AI) technologies, especially deep neural networks (DNNs),
play an vital role in solving many problems in both academia and industry. In order to …
play an vital role in solving many problems in both academia and industry. In order to …
Metis: Fast Automatic Distributed Training on Heterogeneous {GPUs}
As deep learning model sizes expand and new GPUs are released every year, the need for
distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end …
distributed training on heterogeneous GPUs rises to fully harness under-utilized low-end …
Sextans: A streaming accelerator for general-purpose sparse-matrix dense-matrix multiplication
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of
applications including scientific computing, graph processing, and deep learning …
applications including scientific computing, graph processing, and deep learning …
Prague: High-performance heterogeneity-aware asynchronous decentralized training
Distributed deep learning training usually adopts All-Reduce as the synchronization
mechanism for data parallel algorithms due to its high performance in homogeneous …
mechanism for data parallel algorithms due to its high performance in homogeneous …