Understanding and Minimising Outlier Features in Transformer Training

B He, L Noci, D Paliotta, I Schlag… - Advances in Neural …, 2025 - proceedings.neurips.cc
Abstract Outlier Features (OFs) are neurons whose activation magnitudes significantly
exceed the average over a neural network's (NN) width. They are well known to emerge …

Understanding and minimising outlier features in neural network training

B He, L Noci, D Paliotta, I Schlag… - arxiv preprint arxiv …, 2024 - arxiv.org
Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the
average over a neural network's (NN) width. They are well known to emerge during standard …

Towards a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

B Liu, MA Ojewale, Y Ding, M Canini - … of the 15th ACM SIGOPS Asia …, 2024 - dl.acm.org
We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate
DNN training workloads. We argue that to accurately observe performance, it is possible to …

Entropy-Guided Attention for Private LLMs

NK Jha, B Reagen - arxiv preprint arxiv:2501.03489, 2025 - arxiv.org
The pervasiveness of proprietary language models has raised critical privacy concerns,
necessitating advancements in private inference (PI), where computations are performed …

AERO: Softmax-Only LLMs for Efficient Private Inference

NK Jha, B Reagen - arxiv preprint arxiv:2410.13060, 2024 - arxiv.org
The pervasiveness of proprietary language models has raised privacy concerns for users'
sensitive data, emphasizing the need for private inference (PI), where inference is performed …

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

NK Jha, B Reagen - arxiv preprint arxiv:2410.09637, 2024 - arxiv.org
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing
training and ensuring smooth optimization. However, it introduces significant challenges in …

Testing knowledge distillation theories with dataset size

G Lanzillotta, F Sarnthein, G Kur… - … 2024 Workshop on …, 2024 - openreview.net
The concept of knowledge distillation (KD) describes the training of a student model with a
teacher model and is a widespread technique in deep learning. However, it is still not clear …

Compositional visual reasoning and generalization with neural networks

A Stanić - 2024 - folia.unifr.ch
Deep neural networks (NNs) recently revolutionized the field of Artificial Intelligence, making
great progress in computer vision, natural language processing, complex game play …