- Academic Search

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Zapisz Cytuj Cytowane przez 153 Powiązane artykuły Wszystkie wersje 4

[Free GPT-4]

[PDF] acm.org

Video description: A survey of methods, datasets, and evaluation metrics

N Aafaq, A Mian, W Liu, SZ Gilani, M Shah - ACM Computing Surveys …, 2019 - dl.acm.org

Video description is the automatic generation of natural language sentences that describe
the contents of a given video. It has applications in human-robot interaction, hel** the …

Zapisz Cytuj Cytowane przez 257 Powiązane artykuły Wszystkie wersje 10

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Zapisz Cytuj Cytowane przez 238 Powiązane artykuły Wszystkie wersje 26 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Zapisz Cytuj Cytowane przez 99 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Ferret: Refer and ground anything anywhere at any granularity

H You, H Zhang, Z Gan, X Du, B Zhang, Z Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of
understanding spatial referring of any shape or granularity within an image and accurately …

Zapisz Cytuj Cytowane przez 242 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com

Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

Zapisz Cytuj Cytowane przez 214 Powiązane artykuły Wszystkie wersje 6 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Hero: Hierarchical encoder for video+ language omni-representation pre-training

L Li, YC Chen, Y Cheng, Z Gan, L Yu, J Liu - arxiv preprint arxiv …, 2020 - arxiv.org

We present HERO, a novel framework for large-scale video+ language omni-representation
learning. HERO encodes multimodal inputs in a hierarchical structure, where local context of …

Zapisz Cytuj Cytowane przez 568 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

[Free GPT-4]

[PDF] aaai.org

Unified vision-language pre-training for image captioning and vqa

L Zhou, H Palangi, L Zhang, H Hu, J Corso… - Proceedings of the AAAI …, 2020 - ojs.aaai.org

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is
unified in that (1) it can be fine-tuned for either vision-language generation (eg, image …

Zapisz Cytuj Cytowane przez 1040 Powiązane artykuły Wszystkie wersje 7 Wersja HTML

[Free GPT-4]

[PDF] thecvf.com

Actbert: Learning global-local video-text representations

L Zhu, Y Yang - Proceedings of the IEEE/CVF conference …, 2020 - openaccess.thecvf.com

In this paper, we introduce ActBERT for self-supervised learning of joint video-text
representations from unlabeled data. First, we leverage global action information to catalyze …

Zapisz Cytuj Cytowane przez 511 Powiązane artykuły Wszystkie wersje 10 Wersja HTML

[Free GPT-4]

[PDF] springer.com

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100

D Damen, H Doughty, GM Farinella, A Furnari… - International Journal of …, 2022 - Springer

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-
KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M …

Zapisz Cytuj Cytowane przez 546 Powiązane artykuły Wszystkie wersje 13

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Grounded video description

Self-supervised learning for videos: A survey

Video description: A survey of methods, datasets, and evaluation metrics

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Ferret: Refer and ground anything anywhere at any granularity

End-to-end dense video captioning with parallel decoding

Hero: Hierarchical encoder for video+ language omni-representation pre-training

Unified vision-language pre-training for image captioning and vqa

Actbert: Learning global-local video-text representations

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100