- Academic Search

S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang… - ar** an end-to-end chat-centric video
understanding system, coined as VideoChat. It integrates video foundation models and …‏

שמור צטט צוטט על ידי 596 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual chatgpt: Talking, drawing and editing with visual foundation models‏

C Wu, S Yin, W Qi, X Wang, Z Tang, N Duan - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

ChatGPT is attracting a cross-field interest as it provides a language interface with
remarkable conversational competency and reasoning capabilities across many domains …‏

שמור צטט צוטט על ידי 675 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Objaverse: A universe of annotated 3d objects‏

M Deitke, D Schwenk, J Salvador… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and
LAION have propelled recent dramatic progress in AI. Large neural models trained on such …‏

שמור צטט צוטט על ידי 793 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo2: Scaling foundation models for multimodal video understanding‏

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024‏ - Springer‏

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …‏

שמור צטט צוטט על ידי 132 מאמרים בנושא זה כל 5 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …‏

שמור צטט צוטט על ידי 243 מאמרים בנושא זה כל 19 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Timechat: A time-sensitive multimodal large language model for long video understanding‏

S Ren, L Yao, S Li, X Sun… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com‏

This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …‏

שמור צטט צוטט על ידי 138 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action‏

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …‏

שמור צטט צוטט על ידי 126 מאמרים בנושא זה כל 7 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internvideo: General video foundation models via generative and discriminative learning‏

Y Wang, K Li, Y Li, Y He, B Huang, Z Zhao… - arxiv preprint arxiv …, 2022‏ - arxiv.org‏

The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …‏

שמור צטט צוטט על ידי 342 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Merlot reserve: Neural script knowledge through vision and language and sound

Instruction tuning for large language models: A survey‏

Visual chatgpt: Talking, drawing and editing with visual foundation models‏

Objaverse: A universe of annotated 3d objects‏

Internvideo2: Scaling foundation models for multimodal video understanding‏

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning‏

Timechat: A time-sensitive multimodal large language model for long video understanding‏

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action‏

Internvideo: General video foundation models via generative and discriminative learning‏