Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Diffusion feedback helps clip see better

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …

[PDF][PDF] Baichuan-omni technical report

Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arxiv preprint arxiv …, 2024 - researchgate.net
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Y Yan, W **e - arxiv preprint arxiv:2407.12735, 2024 - arxiv.org
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …

Omnibind: Large-scale omni multimodal representation via binding spaces

Z Wang, Z Zhang, H Zhang, L Liu, R Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, human-computer interaction with various modalities has shown promising
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …

No filter: Cultural and socioeconomic diversityin contrastive vision-language models

A Pouget, L Beyer, E Bugliarello, X Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We study cultural and socioeconomic diversity in contrastive vision-language models
(VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to …

Label Propagation for Zero-shot Classification with Vision-Language Models

Y Kalantidis, G Tolias - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) have demonstrated impressive performance on
zero-shot classification ie classification when provided merely with a list of class names. In …

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

S Ruan, Y Dong, H Liu, Y Huang, H Su… - European Conference on …, 2024 - Springer
Abstract Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable
success in computer vision and particularly demonstrated superior robustness to distribution …

Label Propagation for Zero-shot Classification with Vision-Language Models

V Stojnić, Y Kalantidis, G Tolias - arxiv preprint arxiv:2404.04072, 2024 - arxiv.org
Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot
classification, ie classification when provided merely with a list of class names. In this paper …