Srformer: Permuted self-attention for single image super-resolution

Y Zhou, Z Li, CL Guo, S Bai… - Proceedings of the …, 2023 - openaccess.thecvf.com
Previous works have shown that increasing the window size for Transformer-based image
super-resolution models (eg, SwinIR) can significantly improve the model performance but …

Relational context learning for human-object interaction detection

S Kim, D Jung, M Cho - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Recent state-of-the-art methods for HOI detection typically build on transformer architectures
with two decoder branches, one for human-object pair detection and the other for interaction …

With a little help from your own past: Prototypical memory networks for image captioning

M Barraco, S Sarto, M Cornia… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image captioning, like many tasks involving vision and language, currently relies on
Transformer-based architectures for extracting the semantics in an image and translating it …

Temo: Towards text-driven 3d stylization for multi-object meshes

X Zhang, BW Yin, Y Chen, Z Lin, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent progress in the text-driven 3D stylization of a single object has been considerably
promoted by CLIP-based methods. However the stylization of multi-object 3D scenes is still …

Dformer: Rethinking rgbd representation learning for semantic segmentation

B Yin, X Zhang, Z Li, L Liu, MM Cheng… - arxiv preprint arxiv …, 2023 - arxiv.org
We present DFormer, a novel RGB-D pretraining framework to learn transferable
representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) …

Referring camouflaged object detection

X Zhang, B Yin, Z Lin, Q Hou, DP Fan… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
We consider the problem of referring camouflaged object detection (Ref-COD), a new task
that aims to segment specified camouflaged objects based on a small set of referring images …

Cross on cross attention: Deep fusion transformer for image captioning

J Zhang, Y **e, W Ding, Z Wang - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Numerous studies have shown that in-depth mining of correlations between multi-modal
features can help improve the accuracy of cross-modal data analysis tasks. However, the …

Controlmllm: Training-free visual prompt learning for multimodal large language models

M Wu, X Cai, J Ji, J Li, O Huang, G Luo, H Fei… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …

Evaluating and analyzing relationship hallucinations in large vision-language models

M Wu, J Ji, O Huang, J Li, Y Wu, X Sun, R Ji - arxiv preprint arxiv …, 2024 - arxiv.org
The issue of hallucinations is a prevalent concern in existing Large Vision-Language
Models (LVLMs). Previous efforts have primarily focused on investigating object …

Uncertainty-aware image captioning

Z Fei, M Fan, L Zhu, J Huang, X Wei… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
It is well believed that the higher uncertainty in a word of the caption, the more inter-
correlated context information is required to determine it. However, current image captioning …