Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding
With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …
build vision-language foundation models has gained much more interest recently. However …
Omnitokenizer: A joint image-video tokenizer for visual generation
Tokenizer, serving as a translator to map the intricate visual data into a compact latent
space, lies at the core of visual generative models. Based on the finding that existing …
space, lies at the core of visual generative models. Based on the finding that existing …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Exploring pre-trained text-to-video diffusion models for referring video object segmentation
In this paper, we explore the visual representations produced from a pre-trained text-to-
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …
video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent …
Trafficvlm: A controllable visual language model for traffic video captioning
Traffic video description and analysis have received much attention recently due to the
growing demand for efficient and reliable urban surveillance systems. Most existing methods …
growing demand for efficient and reliable urban surveillance systems. Most existing methods …
Aid: Adapting image2video diffusion models for instruction-guided video prediction
Text-guided video prediction (TVP) involves predicting the motion of future frames from the
initial frame according to an instruction, which has wide applications in virtual reality …
initial frame according to an instruction, which has wide applications in virtual reality …
OmniTracker: Unifying Visual Object Tracking by Tracking-with-Detection
Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video
sequence, which is an important vision task with various real-world applications. Depending …
sequence, which is an important vision task with various real-world applications. Depending …
Do language models understand time?
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video summarization …
applications, including action recognition, anomaly detection, and video summarization …
EIKA: Explicit & Implicit Knowledge-Augmented Network for entity-aware sports video captioning
Sports video captioning in real application scenarios requires both entities and specific
scenes. However, it is difficult to extract this fine-grained information solely from the video …
scenes. However, it is difficult to extract this fine-grained information solely from the video …
A simple yet effective knowledge guided method for entity-aware video captioning on a basketball benchmark
Despite the recent emergence of video captioning models, how to generate the text
description with specific entity names and fine-grained actions is far from being solved …
description with specific entity names and fine-grained actions is far from being solved …