Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao… - Advances in …, 2025 - proceedings.neurips.cc
Abstract The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in comprehending fine …

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Aigiqa-20k: A large database for ai-generated image quality assessment

C Li, T Kou, Y Gao, Y Cao, W Sun… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the rapid advancements in AI-Generated Content (AIGC) AI-Generated Images (AIGIs)
have been widely applied in entertainment education and social media. However due to the …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi… - Advances in …, 2025 - proceedings.neurips.cc
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

What If We Recaption Billions of Web Images with LLaMA-3?

X Li, H Tu, M Hui, Z Wang, B Zhao, J **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that
semantically aligning and enriching textual descriptions of these pairs can significantly …

Motionclone: Training-free motion cloning for controllable video generation

P Ling, J Bu, P Zhang, X Dong, Y Zang, T Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Motion-based controllable video generation offers the potential for creating captivating
visual content. Existing methods typically necessitate model training to encode particular …

Lotlip: Improving language-image pre-training for long text understanding

W Wu, K Zheng, S Ma, F Lu, Y Guo… - Advances in …, 2025 - proceedings.neurips.cc
In this work, we empirically confirm that the key reason causing such an issue is that the
training images are usually paired with short captions, leaving certain tokens easily …

Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction

L **ng, Q Huang, X Dong, J Lu, P Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
In large vision-language models (LVLMs), images serve as inputs that carry a wealth of
information. As the idiom" A picture is worth a thousand words" implies, representing a …

E5-v: Universal embeddings with multimodal large language models

T Jiang, M Song, Z Zhang, H Huang, W Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …

Finecliper: Multi-modal fine-grained clip for dynamic facial expression recognition with adapters

H Chen, H Huang, J Dong, M Zheng… - Proceedings of the 32nd …, 2024 - dl.acm.org
Dynamic Facial Expression Recognition (DFER) is crucial for understanding human
behavior. However, current methods exhibit limited performance mainly due to the …