Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

B Gao, MW Spratling - arxiv preprint arxiv:2501.13428, 2025 - arxiv.org
Large language models have achieved remarkable success in recent years, primarily due to
the implementation of self-attention mechanisms. However, traditional Softmax attention …