Srformer: Permuted self-attention for single image super-resolution
Previous works have shown that increasing the window size for Transformer-based image
super-resolution models (eg, SwinIR) can significantly improve the model performance but …
super-resolution models (eg, SwinIR) can significantly improve the model performance but …
Relational context learning for human-object interaction detection
Recent state-of-the-art methods for HOI detection typically build on transformer architectures
with two decoder branches, one for human-object pair detection and the other for interaction …
with two decoder branches, one for human-object pair detection and the other for interaction …
With a little help from your own past: Prototypical memory networks for image captioning
Image captioning, like many tasks involving vision and language, currently relies on
Transformer-based architectures for extracting the semantics in an image and translating it …
Transformer-based architectures for extracting the semantics in an image and translating it …
Temo: Towards text-driven 3d stylization for multi-object meshes
Recent progress in the text-driven 3D stylization of a single object has been considerably
promoted by CLIP-based methods. However the stylization of multi-object 3D scenes is still …
promoted by CLIP-based methods. However the stylization of multi-object 3D scenes is still …
Dformer: Rethinking rgbd representation learning for semantic segmentation
We present DFormer, a novel RGB-D pretraining framework to learn transferable
representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) …
representations for RGB-D segmentation tasks. DFormer has two new key innovations: 1) …
Referring camouflaged object detection
We consider the problem of referring camouflaged object detection (Ref-COD), a new task
that aims to segment specified camouflaged objects based on a small set of referring images …
that aims to segment specified camouflaged objects based on a small set of referring images …
Cross on cross attention: Deep fusion transformer for image captioning
Numerous studies have shown that in-depth mining of correlations between multi-modal
features can help improve the accuracy of cross-modal data analysis tasks. However, the …
features can help improve the accuracy of cross-modal data analysis tasks. However, the …
Controlmllm: Training-free visual prompt learning for multimodal large language models
In this work, we propose a training-free method to inject visual referring into Multimodal
Large Language Models (MLLMs) through learnable visual token optimization. We observe …
Large Language Models (MLLMs) through learnable visual token optimization. We observe …
Evaluating and analyzing relationship hallucinations in large vision-language models
The issue of hallucinations is a prevalent concern in existing Large Vision-Language
Models (LVLMs). Previous efforts have primarily focused on investigating object …
Models (LVLMs). Previous efforts have primarily focused on investigating object …
Uncertainty-aware image captioning
It is well believed that the higher uncertainty in a word of the caption, the more inter-
correlated context information is required to determine it. However, current image captioning …
correlated context information is required to determine it. However, current image captioning …