Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Seed-x: Multimodal models with unified multi-granularity comprehension and generation

Y Ge, S Zhao, J Zhu, Y Ge, K Yi, L Song, C Li… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid evolution of multimodal foundation model has demonstrated significant
progresses in vision-language understanding and generation, eg, our previous work SEED …

Task me anything

J Zhang, W Huang, Z Ma, O Michel, D He… - arxiv preprint arxiv …, 2024 - arxiv.org
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously
assess the general capabilities of models instead of evaluating for a specific capability. As a …

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

D Muhtar, Z Li, F Gu, X Zhang, P **ao - European Conference on …, 2024 - Springer
The revolutionary capabilities of large language models (LLMs) have paved the way for
multimodal large language models (MLLMs) and fostered diverse applications across …

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models

J Liang, S Liang, A Liu, X Cao - International Journal of Computer Vision, 2025 - Springer
Abstract Autoregressive Visual Language Models (VLMs) demonstrate remarkable few-shot
learning capabilities within a multimodal context. Recently, multimodal instruction tuning has …

Scifibench: Benchmarking large multimodal models for scientific figure interpretation

J Roberts, K Han, N Houlsby… - Advances in Neural …, 2025 - proceedings.neurips.cc
Large multimodal models (LMMs) have proven flexible and generalisable across many tasks
and fields. Although they have strong potential to aid scientific research, their capabilities in …

Vhelm: A holistic evaluation of vision language models

T Lee, H Tu, CH Wong, W Zheng… - Advances in …, 2025 - proceedings.neurips.cc
Current benchmarks for assessing vision-language models (VLMs) often focus on their
perception or problem-solving capabilities and neglect other critical aspects such as …