Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Video instruction tuning with synthetic data
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
Longvila: Scaling long-context visual language models for long videos
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
Longvlm: Efficient long video understanding via large language models
Abstract Empowered by Large Language Models (LLMs), recent advancements in Video-
based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These …
based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These …
Mlp can be a good transformer learner
Self-attention mechanism is the key of the Transformer but often criticized for its computation
demands. Previous token pruning works motivate their methods from the view of …
demands. Previous token pruning works motivate their methods from the view of …
V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning
Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …
videos. Despite the existence of various video summarization datasets, a notable limitation …
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in
numerous applications. However, the emphasis on brief summary texts during pre-training …
numerous applications. However, the emphasis on brief summary texts during pre-training …
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies
Audio descriptions (ADs) function as acoustic commentaries designed to assist blind
persons and persons with visual impairments in accessing digital media content on …
persons and persons with visual impairments in accessing digital media content on …
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
Transforming recorded videos into concise and accurate textual summaries is a growing
challenge in multimodal learning. This paper introduces VISTA, a dataset specifically …
challenge in multimodal learning. This paper introduces VISTA, a dataset specifically …
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic
segmentation and grounded image captioning. Building upon the COCO dataset with …
segmentation and grounded image captioning. Building upon the COCO dataset with …
Shotluck holmes: A family of efficient small-scale large language vision models for video captioning and summarization
Video is an increasingly prominent and information-dense medium, yet it poses substantial
challenges for language models. A typical video consists of a sequence of shorter segments …
challenges for language models. A typical video consists of a sequence of shorter segments …