E5-v: Universal embeddings with multimodal large language models

T Jiang, M Song, Z Zhang, H Huang, W Deng… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have shown promising advancements in
general visual and language understanding. However, the representation of multimodal …

Learning commonality, divergence and variety for unsupervised visible-infrared person re-identification

J Shi, X Yin, Y Zhang, Z Zhang, Y **e, Y Qu - arxiv preprint arxiv …, 2024 - arxiv.org
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to match specified
people in infrared images to visible images without annotations, and vice versa. USVI-ReID …

Progressive multimodal reasoning via active retrieval

G Dong, C Zhang, M Deng, Y Zhu, Z Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-step multimodal reasoning tasks pose significant challenges for multimodal large
language models (MLLMs), and finding effective ways to enhance their performance in such …

When Text Embedding Meets Large Language Model: A Comprehensive Survey

Z Nie, Z Feng, M Li, C Zhang, Y Zhang, D Long… - arxiv preprint arxiv …, 2024 - arxiv.org
Text embedding has become a foundational technology in natural language processing
(NLP) during the deep learning era, driving advancements across a wide array of …

InfiR: Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

C **e, S Cai, W Wang, P Li, Z Sang, K Yang… - arxiv preprint arxiv …, 2025 - arxiv.org
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have
made significant advancements in reasoning capabilities. However, they still face …

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

J Zhou, Z Liu, Z Liu, S **ao, Y Wang, B Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains
severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a …

O1 Embedder: Let Retrievers Think Before Action

R Yan, Z Liu, D Lian - arxiv preprint arxiv:2502.07555, 2025 - arxiv.org
The growing power of large language models (LLMs) has revolutionized how people access
and utilize information. Notably, the LLMs excel at performing fine-grained data …

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval

Z Liu, Z Liang, J Zhou, Z Liu, D Lian - arxiv preprint arxiv:2502.11431, 2025 - arxiv.org
With the popularity of multimodal techniques, it receives growing interests to acquire useful
information in visual forms. In this work, we formally define an emerging IR paradigm …

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

X Zhang, Y Zhang, W **e, M Li, Z Dai, D Long… - arxiv preprint arxiv …, 2024 - arxiv.org
Universal Multimodal Retrieval (UMR) aims to enable search across various modalities
using a unified model, where queries and candidates can consist of pure text, images, or a …

Fine-grained Video-Text Retrieval: A New Benchmark and Method

Y Xu, X Li, Y Yang, R Huang, L Wang - arxiv preprint arxiv:2501.00513, 2024 - arxiv.org
The ability of perceiving fine-grained spatial and temporal information is crucial for video-
language retrieval. However, the existing video retrieval benchmarks, such as MSRVTT and …