- Academic Search

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Save Cite Cited by 118 Related articles All 3 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

X Wang, Y Zhang, O Zohar, S Yeung-Levy - European Conference on …, 2024 - Springer

Long-form video understanding represents a significant challenge within computer vision,
demanding a model capable of reasoning over long multi-modal sequences. Motivated by …

Save Cite Cited by 40 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Diffusion feedback helps clip see better

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …

Save Cite Cited by 9 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] researchgate.net

[PDF][PDF] Baichuan-omni technical report

Y Li, H Sun, M Lin, T Li, G Dong, T Zhang… - arxiv preprint arxiv …, 2024 - researchgate.net

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical
role in practical applications, yet it lacks a high-performing open-source counterpart. In this …

Save Cite Cited by 9 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Y Yan, W **e - arxiv preprint arxiv:2407.12735, 2024 - arxiv.org

Knowledge-based Visual Question Answering (KVQA) tasks require answering questions
about images using extensive background knowledge. Despite significant advancements …

Save Cite Cited by 5 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Omnibind: Large-scale omni multimodal representation via binding spaces

Z Wang, Z Zhang, H Zhang, L Liu, R Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recently, human-computer interaction with various modalities has shown promising
applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint …

Save Cite Cited by 6 Related articles All 2 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

No filter: Cultural and socioeconomic diversityin contrastive vision-language models

A Pouget, L Beyer, E Bugliarello, X Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We study cultural and socioeconomic diversity in contrastive vision-language models
(VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to …

Save Cite Cited by 7 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Label Propagation for Zero-shot Classification with Vision-Language Models

Y Kalantidis, G Tolias - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Abstract Vision-Language Models (VLMs) have demonstrated impressive performance on
zero-shot classification ie classification when provided merely with a list of class names. In …

Save Cite Cited by 4 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

S Ruan, Y Dong, H Liu, Y Huang, H Su… - European Conference on …, 2024 - Springer

Abstract Vision-Language Pre-training (VLP) models like CLIP have achieved remarkable
success in computer vision and particularly demonstrated superior robustness to distribution …

Save Cite Cited by 1 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Label Propagation for Zero-shot Classification with Vision-Language Models

V Stojnić, Y Kalantidis, G Tolias - arxiv preprint arxiv:2404.04072, 2024 - arxiv.org

Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot
classification, ie classification when provided merely with a list of class names. In this paper …

Save Cite Cited by 2 Related articles All 2 versions Free GPT-4 View as HTML

Create alert

Cite

Advanced search

Saved to My library

Eva-clip-18b: Scaling clip to 18 billion parameters

Internvideo2: Scaling foundation models for multimodal video understanding

VideoAgent: Long-Form Video Understanding with Large Language Model as Agent

Diffusion feedback helps clip see better

[PDF][PDF] Baichuan-omni technical report

EchoSight: Advancing Visual-Language Models with Wiki Knowledge

Omnibind: Large-scale omni multimodal representation via binding spaces

No filter: Cultural and socioeconomic diversityin contrastive vision-language models

Label Propagation for Zero-shot Classification with Vision-Language Models

Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models

Label Propagation for Zero-shot Classification with Vision-Language Models