On the optimization and generalization of two-layer transformers with sign gradient descent

B Li, W Huang, A Han, Z Zhou, T Suzuki, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The Adam optimizer is widely used for transformer optimization in practice, which makes
understanding the underlying optimization mechanisms an important problem. However …

On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

B Li, W Huang, A Han, Z Zhou, T Suzuki, J Zhu… - … Conference on Learning … - openreview.net
The Adam optimizer is widely used for transformer optimization in practice, which makes
understanding the underlying optimization mechanisms an important problem. However …