The future of computing beyond Moore's Law

J Shalf - Philosophical Transactions of the Royal Society …, 2020 - royalsocietypublishing.org
Moore's Law is a techno-economic model that has enabled the information technology
industry to double the performance and functionality of digital electronics roughly every 2 …

Efficient hardware architectures for accelerating deep neural networks: Survey

P Dhilleswararao, S Boppu, MS Manikandan… - IEEE …, 2022 - ieeexplore.ieee.org
In the modern-day era of technology, a paradigm shift has been witnessed in the areas
involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep …

[PDF][PDF] Mamba: Linear-time sequence modeling with selective state spaces

A Gu, T Dao - arxiv preprint arxiv:2312.00752, 2023 - minjiazhang.github.io
Foundation models, now powering most of the exciting applications in deep learning, are
almost universally based on the Transformer architecture and its core attention module …

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

J Ainslie, J Lee-Thorp, M De Jong… - arxiv preprint arxiv …, 2023 - arxiv.org
Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up
decoder inference. However, MQA can lead to quality degradation, and moreover it may not …

A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu
Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N Jouppi, G Kurian, S Li, P Ma, R Nagarajan… - Proceedings of the 50th …, 2023 - dl.acm.org
In response to innovations in machine learning (ML) models, production workloads changed
radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its …

Efficiently scaling transformer inference

R Pope, S Douglas, A Chowdhery… - Proceedings of …, 2023 - proceedings.mlsys.org
We study the problem of efficient generative inference for Transformer models, in one of its
most challenging settings: large deep models, with tight latency targets and long sequence …

MobileNetV4: universal models for the mobile ecosystem

D Qin, C Leichner, M Delakis, M Fornoni, S Luo… - … on Computer Vision, 2024 - Springer
We present the latest generation of MobileNets: MobileNetV4 (MNv4). They feature
universally-efficient architecture designs for mobile devices. We introduce the Universal …

The case for 4-bit precision: k-bit inference scaling laws

T Dettmers, L Zettlemoyer - International Conference on …, 2023 - proceedings.mlr.press
Quantization methods reduce the number of bits required to represent each parameter in a
model, trading accuracy for smaller memory footprints and inference latencies. However, the …

Flashattention: Fast and memory-efficient exact attention with io-awareness

T Dao, D Fu, S Ermon, A Rudra… - Advances in neural …, 2022 - proceedings.neurips.cc
Transformers are slow and memory-hungry on long sequences, since the time and memory
complexity of self-attention are quadratic in sequence length. Approximate attention …