- Academic Search

J Shalf - Philosophical Transactions of the Royal Society …, 2020 - royalsocietypublishing.org

Moore's Law is a techno-economic model that has enabled the information technology
industry to double the performance and functionality of digital electronics roughly every 2 …

Lagre Referanse Sitert av 656 Beslektede artikler Alle 14 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Efficient hardware architectures for accelerating deep neural networks: Survey

P Dhilleswararao, S Boppu, MS Manikandan… - IEEE …, 2022 - ieeexplore.ieee.org

In the modern-day era of technology, a paradigm shift has been witnessed in the areas
involving applications of Artificial Intelligence (AI), Machine Learning (ML), and Deep …

Lagre Referanse Sitert av 87 Beslektede artikler Alle 3 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] github.io

[PDF][PDF] Mamba: Linear-time sequence modeling with selective state spaces

A Gu, T Dao - arxiv preprint arxiv:2312.00752, 2023 - minjiazhang.github.io

Foundation models, now powering most of the exciting applications in deep learning, are
almost universally based on the Transformer architecture and its core attention module …

Lagre Referanse Sitert av 2293 Beslektede artikler Alle 11 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

J Ainslie, J Lee-Thorp, M De Jong… - arxiv preprint arxiv …, 2023 - arxiv.org

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up
decoder inference. However, MQA can lead to quality degradation, and moreover it may not …

Lagre Referanse Sitert av 579 Beslektede artikler Alle 4 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] mit.edu

A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu

Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

Lagre Referanse Sitert av 253 Beslektede artikler Alle 4 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N Jouppi, G Kurian, S Li, P Ma, R Nagarajan… - Proceedings of the 50th …, 2023 - dl.acm.org

In response to innovations in machine learning (ML) models, production workloads changed
radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its …

Lagre Referanse Sitert av 333 Beslektede artikler Alle 8 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] mlsys.org

Efficiently scaling transformer inference

R Pope, S Douglas, A Chowdhery… - Proceedings of …, 2023 - proceedings.mlsys.org

We study the problem of efficient generative inference for Transformer models, in one of its
most challenging settings: large deep models, with tight latency targets and long sequence …

Lagre Referanse Sitert av 365 Beslektede artikler Alle 7 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MobileNetV4: universal models for the mobile ecosystem

D Qin, C Leichner, M Delakis, M Fornoni, S Luo… - … on Computer Vision, 2024 - Springer

We present the latest generation of MobileNets: MobileNetV4 (MNv4). They feature
universally-efficient architecture designs for mobile devices. We introduce the Universal …

Lagre Referanse Sitert av 102 Beslektede artikler Alle 8 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

The case for 4-bit precision: k-bit inference scaling laws

T Dettmers, L Zettlemoyer - International Conference on …, 2023 - proceedings.mlr.press

Quantization methods reduce the number of bits required to represent each parameter in a
model, trading accuracy for smaller memory footprints and inference latencies. However, the …

Lagre Referanse Sitert av 198 Beslektede artikler Alle 7 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Flashattention: Fast and memory-efficient exact attention with io-awareness

T Dao, D Fu, S Ermon, A Rudra… - Advances in neural …, 2022 - proceedings.neurips.cc

Transformers are slow and memory-hungry on long sequences, since the time and memory
complexity of self-attention are quadratic in sequence length. Approximate attention …

Lagre Referanse Sitert av 1831 Beslektede artikler Alle 10 versjoner HTML-versjon

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Roofline: an insightful visual performance model for multicore architectures

The future of computing beyond Moore's Law

Efficient hardware architectures for accelerating deep neural networks: Survey

[PDF][PDF] Mamba: Linear-time sequence modeling with selective state spaces

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

A survey on model compression for large language models

Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

Efficiently scaling transformer inference

MobileNetV4: universal models for the mobile ecosystem

The case for 4-bit precision: k-bit inference scaling laws

Flashattention: Fast and memory-efficient exact attention with io-awareness