- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Salva Cita Citato da 584 Articoli correlati Tutte e 4 le versioni Versione HTML

[Free GPT-4]

[PDF] thecvf.com

Objaverse: A universe of annotated 3d objects

M Deitke, D Schwenk, J Salvador… - Proceedings of the …, 2023 - openaccess.thecvf.com

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and
LAION have propelled recent dramatic progress in AI. Large neural models trained on such …

Salva Cita Citato da 752 Articoli correlati Tutte e 5 le versioni Versione HTML

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Salva Cita Citato da 118 Articoli correlati Tutte e 3 le versioni

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Salva Cita Citato da 238 Articoli correlati Tutte e 26 le versioni Versione HTML

[Free GPT-4]

[PDF] neurips.cc

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc

Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Salva Cita Citato da 147 Articoli correlati Tutte e 8 le versioni Versione HTML

[Free GPT-4]

[PDF] arxiv.org

Gpt4roi: Instruction tuning large language model on region-of-interest

S Zhang, P Sun, S Chen, M **ao, W Shao… - arxiv preprint arxiv …, 2023 - arxiv.org

Instruction tuning large language model (LLM) on image-text pairs has achieved
unprecedented vision-language multimodal abilities. However, their vision-language …

Salva Cita Citato da 208 Articoli correlati Tutte e 3 le versioni Versione HTML

[Free GPT-4]

[PDF] neurips.cc

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc

Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …

Salva Cita Citato da 3703 Articoli correlati Tutte e 7 le versioni Versione HTML

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

Merlot reserve: Neural script knowledge through vision and language and sound

Mm-llms: Recent advances in multimodal large language models

Objaverse: A universe of annotated 3d objects

Internvideo2: Scaling foundation models for multimodal video understanding

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Any-to-any generation via composable diffusion

Gpt4roi: Instruction tuning large language model on region-of-interest

Flamingo: a visual language model for few-shot learning