Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Objaverse: A universe of annotated 3d objects

M Deitke, D Schwenk, J Salvador… - Proceedings of the …, 2023 - openaccess.thecvf.com
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and
LAION have propelled recent dramatic progress in AI. Large neural models trained on such …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Any-to-any generation via composable diffusion

Z Tang, Z Yang, C Zhu, M Zeng… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image, video, or audio …

Gpt4roi: Instruction tuning large language model on region-of-interest

S Zhang, P Sun, S Chen, M **ao, W Shao… - arxiv preprint arxiv …, 2023 - arxiv.org
Instruction tuning large language model (LLM) on image-text pairs has achieved
unprecedented vision-language multimodal abilities. However, their vision-language …

Flamingo: a visual language model for few-shot learning

JB Alayrac, J Donahue, P Luc… - Advances in neural …, 2022 - proceedings.neurips.cc
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …