Video understanding with large language models: A survey

Y Tang, J Bi, S Xu, L Song, S Liang, T Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Progress-Aware Video Frame Captioning

Z Xue, J An, X Yang, K Grauman - arxiv preprint arxiv:2412.02071, 2024 - arxiv.org
While image captioning provides isolated descriptions for individual images, and video
captioning offers one single narrative for an entire video clip, our work explores an important …

Describe Now: User-Driven Audio Description for Blind and Low Vision Individuals

M Cheema, H Seifi, P Fazli - arxiv preprint arxiv:2411.11835, 2024 - arxiv.org
Audio descriptions (AD) make videos accessible for blind and low vision (BLV) users by
describing visual elements that cannot be understood from the main audio track. AD created …

SemiKong: Curating, Training, and Evaluating A Semiconductor Industry-Specific Large Language Model

C Nguyen, W Nguyen, A Suzuki, D Oku… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated the potential to address some issues
within the semiconductor industry. However, they are often general-purpose models that …

Coherent Physical Commonsense Reasoning in Foundational Language Models

S Storks - 2024 - deepblue.lib.umich.edu
Recent years in natural language processing (NLP) research have seen a paradigm shift
toward foundational language models (LMs), which are self-supervised, transformer-based …

User-Driven Automated Audio Description to Enhance Video Accessibility for Blind and Low Vision Users

MS Cheema - 2024 - search.proquest.com
Audio descriptions (AD) make videos accessible for blind and low vision (BLV) users by
describing visual elements that cannot be understood from the main audio track. AD created …

[PDF][PDF] STATE-AWARE OBJECT UNDERSTANDING

NW Nguyen - 2024 - nguyennm1024.github.io
The advent of potent multimodal large language models alongside expansive datasets has
markedly advanced visual understanding tasks. While the bulk of research in this domain …