- Academic Search

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

保存引用被引用次数：197 相关文章所有 7 个版本图书馆搜索 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Videomamba: State space model for efficient video understanding

K Li, X Li, Y Wang, Y He, Y Wang, L Wang… - European Conference on …, 2024 - Springer

Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …

保存引用被引用次数：145 相关文章所有 2 个版本

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

保存引用被引用次数：238 相关文章所有 26 个版本 HTML 版

[Free GPT-4]

[PDF] mlr.press

mplug-2: A modularized multi-modal foundation model across text, image and video

H Xu, Q Ye, M Yan, Y Shi, J Ye, Y Xu… - International …, 2023 - proceedings.mlr.press

Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …

保存引用被引用次数：128 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

TS Chen, A Siarohin, W Menapace… - Proceedings of the …, 2024 - openaccess.thecvf.com

The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …

保存引用被引用次数：129 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Learning open-vocabulary semantic segmentation models from natural language supervision

J Xu, J Hou, Y Zhang, R Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS),
which aims to segment objects of arbitrary classes instead of pre-defined, closed-set …

保存引用被引用次数：92 相关文章所有 7 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P **, J Huang, P **ong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

保存引用被引用次数：65 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Cap4video: What can auxiliary captions do for text-video retrieval?

W Wu, H Luo, B Fang, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …

保存引用被引用次数：90 相关文章所有 6 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arxiv preprint arxiv …, 2023 - arxiv.org

In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

保存引用被引用次数：96 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] thecvf.com

Hitea: Hierarchical temporal-aware video-language pre-training

Q Ye, G Xu, M Yan, H Xu, Q Qian… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-training has advanced the performance of various downstream video-
language tasks. However, most previous methods directly inherit or adapt typical image …

保存引用被引用次数：75 相关文章所有 5 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Bridging video-text retrieval with multiple choice questions

Vision-language pre-training: Basics, recent advances, and future trends

Videomamba: State space model for efficient video understanding

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

mplug-2: A modularized multi-modal foundation model across text, image and video

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Learning open-vocabulary semantic segmentation models from natural language supervision

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

Cap4video: What can auxiliary captions do for text-video retrieval?

Valor: Vision-audio-language omni-perception pretraining model and dataset

Hitea: Hierarchical temporal-aware video-language pre-training