Instruction tuning for large language models: A survey

S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …

Visual chatgpt: Talking, drawing and editing with visual foundation models

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arxiv preprint arxiv …, 2023‏ - arxiv.org
ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …

Objaverse: A universe of annotated 3d objects

M Deitke, D Schwenk, J Salvador… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and
LAION have propelled recent dramatic progress in AI. Large neural models trained on such …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024‏ - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Timechat: A time-sensitive multimodal large language model for long video understanding

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Internvideo: General video foundation models via generative and discriminative learning

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022‏ - arxiv.org
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …