Revisiting non-autoregressive transformers for efficient image synthesis

Z Ni, Y Wang, R Zhou, J Guo, J Hu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The field of image synthesis is currently flourishing due to the advancements in diffusion
models. While diffusion models have been successful their computational intensity has …

Id-animator: Zero-shot identity-preserving human video generation

X He, Q Liu, S Qian, X Wang, T Hu, K Cao… - arxiv preprint arxiv …, 2024 - arxiv.org
Generating high-fidelity human video with specified identities has attracted significant
attention in the content generation community. However, existing techniques struggle to …

Adanat: Exploring adaptive policy for token-based image generation

Z Ni, Y Wang, R Zhou, R Lu, J Guo, J Hu, Z Liu… - … on Computer Vision, 2024 - Springer
Recent studies have demonstrated the effectiveness of token-based methods for visual
content generation. As a representative work, non-autoregressive Transformers (NATs) are …

Visual cot: Unleashing chain-of-thought reasoning in multi-modal language models

H Shao, S Qian, H **ao, G Song, Z Zong… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents Visual CoT, a novel pipeline that leverages the reasoning capabilities of
multi-modal large language models (MLLMs) by incorporating visual Chain-of-Thought …

Extreme image compression using fine-tuned vqgans

Q Mao, T Yang, Y Zhang, Z Wang… - 2024 Data …, 2024 - ieeexplore.ieee.org
Recent advances in generative compression methods have demonstrated remarkable
progress in enhancing the perceptual quality of compressed data, especially in scenarios …

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

X Nie, H **, Y Yan, X Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Predictive learning models which aim to predict future frames based on past observations
are crucial to constructing world models. These models need to maintain low-level …

Text-Animator: Controllable Visual Text Video Generation

L Liu, Q Liu, S Qian, Y Zhou, W Zhou, H Li, L **e… - arxiv preprint arxiv …, 2024 - arxiv.org
Video generation is a challenging yet pivotal task in various industries, such as gaming, e-
commerce, and advertising. One significant unresolved aspect within T2V is the effective …

Enat: Rethinking spatial-temporal interactions in token-based image synthesis

Z Ni, Y Wang, R Zhou, Y Han, J Guo, Z Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, token-based generation have demonstrated their effectiveness in image synthesis.
As a representative example, non-autoregressive Transformers (NATs) can generate decent …