Internvideo2: Scaling foundation models for multimodal video understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …
Diffusion feedback helps clip see better
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …
representations across domains and modalities, has become a foundation for a variety of …
[PDF][PDF] Baichuan-omni technical report
Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arxiv preprint arxiv …, 2024 - researchgate.net
The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …
about images using extensive background knowledge. Despite significant advancements …
Omnibind: Large-scale omni multimodal representation via binding spaces
Recently, human-computer interaction with various modalities has shown promising
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …
No filter: Cultural and socioeconomic diversityin contrastive vision-language models
We study cultural and socioeconomic diversity in contrastive vision-language models
(VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to …
(VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to …
Label Propagation for Zero-shot Classification with Vision-Language Models
Abstract Vision-Language Models (VLMs) have demonstrated impressive performance on
zero-shot classification ie classification when provided merely with a list of class names. In …
zero-shot classification ie classification when provided merely with a list of class names. In …
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Abstract Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable
success in computer vision and particularly demonstrated superior robustness to distribution …
success in computer vision and particularly demonstrated superior robustness to distribution …
Label Propagation for Zero-shot Classification with Vision-Language Models
Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot
classification, ie classification when provided merely with a list of class names. In this paper …
classification, ie classification when provided merely with a list of class names. In this paper …