- Academic Search

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Zapisz Cytuj Cytowane przez 350 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] arxiv.org

Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

Zapisz Cytuj Cytowane przez 59 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Zapisz Cytuj Cytowane przez 34 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arxiv preprint arxiv …, 2024 - arxiv.org

General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Zapisz Cytuj Cytowane przez 33 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Z Chen, W Wang, Y Cao, Y Liu, Z Gao, E Cui… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series
that builds upon InternVL 2.0, maintaining its core model architecture while introducing …

Zapisz Cytuj Cytowane przez 23 Powiązane artykuły Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Longvu: Spatiotemporal adaptive compression for long video-language understanding

X Shen, Y **ong, C Zhao, L Wu, J Chen, C Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) have shown promising progress in
understanding and analyzing video content. However, processing long videos remains a …

Zapisz Cytuj Cytowane przez 16 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

S Yuan, J Huang, Y Xu, Y Liu, S Zhang, Y Shi… - arxiv preprint arxiv …, 2024 - arxiv.org

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to
evaluate the temporal and metamorphic capabilities of the T2V models (eg Sora and …

Zapisz Cytuj Cytowane przez 17 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Unimd: Towards unifying moment retrieval and temporal action detection

Y Zeng, Y Zhong, C Feng, L Ma - European Conference on Computer …, 2024 - Springer

Abstract Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while
Moment Retrieval (MR) aims to identify the events described by open-ended natural …

Zapisz Cytuj Cytowane przez 6 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]

[PDF] arxiv.org

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

H Wang, Z Xu, Y Cheng, S Diao, Y Zhou, Y Cao… - arxiv preprint arxiv …, 2024 - arxiv.org

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in
coarse-grained video understanding, however, they struggle with fine-grained temporal …

Zapisz Cytuj Cytowane przez 7 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Apollo: An exploration of video understanding in large multimodal models

O Zohar, X Wang, Y Dubois, N Mehta, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite the rapid integration of video perception capabilities into Large Multimodal Models
(LMMs), the underlying mechanisms driving their video understanding remain poorly …

Zapisz Cytuj Cytowane przez 6 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Internvideo2: Scaling foundation models for multimodal video understanding

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Video understanding with large language models: A survey

A survey of multimodal large language model from a data-centric perspective

Is sora a world simulator? a comprehensive survey on general world models and beyond

Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling

Longvu: Spatiotemporal adaptive compression for long video-language understanding

Chronomagic-bench: A benchmark for metamorphic evaluation of text-to-time-lapse video generation

Unimd: Towards unifying moment retrieval and temporal action detection

Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models

Apollo: An exploration of video understanding in large multimodal models