A comprehensive survey on applications of transformers for deep learning tasks

S Islam, H Elmekki, A Elsebai, J Bentahar… - Expert Systems with …, 2024 - Elsevier
Abstract Transformers are Deep Neural Networks (DNN) that utilize a self-attention
mechanism to capture contextual relationships within sequential data. Unlike traditional …

A survey on video diffusion models

Z **ng, Q Feng, H Chen, Q Dai, H Hu, H Xu… - ACM Computing …, 2024 - dl.acm.org
The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Scaling vision transformers to 22 billion parameters

M Dehghani, J Djolonga, B Mustafa… - International …, 2023 - proceedings.mlr.press
The scaling of Transformers has driven breakthrough capabilities for language models. At
present, the largest large language models (LLMs) contain upwards of 100B parameters …

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

Retinexformer: One-stage retinex-based transformer for low-light image enhancement

Y Cai, H Bian, J Lin, H Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
When enhancing low-light images, many deep learning algorithms are based on the Retinex
theory. However, the Retinex model does not consider the corruptions hidden in the dark or …

Videocomposer: Compositional video synthesis with motion controllability

X Wang, H Yuan, S Zhang, D Chen… - Advances in …, 2023 - proceedings.neurips.cc
The pursuit of controllability as a higher standard of visual content creation has yielded
remarkable progress in customizable image synthesis. However, achieving controllable …

Modelscope text-to-video technical report

J Wang, H Yuan, D Chen, Y Zhang, X Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a
text-to-image synthesis model (ie, Stable Diffusion). ModelScopeT2V incorporates spatio …

Moviechat: From dense token to sparse memory for long video understanding

E Song, W Chai, G Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently integrating video foundation models and large language models to build a video
understanding system can overcome the limitations of specific pre-defined vision tasks. Yet …

Hiera: A hierarchical vision transformer without the bells-and-whistles

C Ryali, YT Hu, D Bolya, C Wei, H Fan… - International …, 2023 - proceedings.mlr.press
Modern hierarchical vision transformers have added several vision-specific components in
the pursuit of supervised classification performance. While these components lead to …