An empirical study of training end-to-end vision-and-language transformers

ZY Dou, Y Xu, Z Gan, J Wang, S Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision-and-language (VL) pre-training has proven to be highly effective on various
VL downstream tasks. While recent work has shown that fully transformer-based VL models …

Learning deep transformer models for machine translation

Q Wang, B Li, T **ao, J Zhu, C Li, DF Wong… - arxiv preprint arxiv …, 2019 - arxiv.org
Transformer is the state-of-the-art model in recent machine translation evaluations. Two
strands of research are promising to improve models of this kind: the first uses wide …

Improving image captioning by leveraging intra-and inter-layer global representation in transformer network

J Ji, Y Luo, X Sun, F Chen, G Luo, Y Wu… - Proceedings of the AAAI …, 2021 - ojs.aaai.org
Transformer-based architectures have shown great success in image captioning, where
object regions are encoded and then attended into the vectorial representations to guide the …

Bridgetower: Building bridges between encoders in vision-language representation learning

X Xu, C Wu, S Rosenman, V Lal, W Che… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Vision-Language (VL) models with the Two-Tower architecture have dominated visual-
language representation learning in recent years. Current VL models either use lightweight …

Modeling localness for self-attention networks

B Yang, Z Tu, DF Wong, F Meng, LS Chao… - arxiv preprint arxiv …, 2018 - arxiv.org
Self-attention networks have proven to be of profound value for its strength of capturing
global dependencies. In this work, we propose to model localness for self-attention …

Rethinking skip connection with layer normalization in transformers and resnets

F Liu, X Ren, Z Zhang, X Sun, Y Zou - arxiv preprint arxiv:2105.07205, 2021 - arxiv.org
Skip connection, is a widely-used technique to improve the performance and the
convergence of deep neural networks, which is believed to relieve the difficulty in …

Multi-head attention with disagreement regularization

J Li, Z Tu, B Yang, MR Lyu, T Zhang - arxiv preprint arxiv:1810.10183, 2018 - arxiv.org
Multi-head attention is appealing for the ability to jointly attend to information from different
representation subspaces at different positions. In this work, we introduce a disagreement …

On the diversity of multi-head attention

J Li, X Wang, Z Tu, MR Lyu - Neurocomputing, 2021 - Elsevier
Multi-head attention is appealing for the ability to jointly attend to information from different
representation subspaces at different positions. In this work, we propose two approaches to …

Convolutional self-attention networks

B Yang, L Wang, D Wong, LS Chao, Z Tu - arxiv preprint arxiv …, 2019 - arxiv.org
Self-attention networks (SANs) have drawn increasing interest due to their high
parallelization in computation and flexibility in modeling dependencies. SANs can be further …

Context-aware self-attention networks

B Yang, J Li, DF Wong, LS Chao, X Wang… - Proceedings of the AAAI …, 2019 - ojs.aaai.org
Self-attention model has shown its flexibility in parallel computation and the effectiveness on
modeling both long-and short-term dependencies. However, it calculates the dependencies …