Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Llama-vid: An image is worth 2 tokens in large language models
In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …
Timechat: A time-sensitive multimodal large language model for long video understanding
This work proposes TimeChat a time-sensitive multimodal large language model specifically
designed for long video understanding. Our model incorporates two key architectural …
designed for long video understanding. Our model incorporates two key architectural …
Ma-lmm: Memory-augmented large multimodal model for long-term video understanding
With the success of large language models (LLMs) integrating the vision model into LLMs to
build vision-language foundation models has gained much more interest recently. However …
build vision-language foundation models has gained much more interest recently. However …
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …
have emerged as a focal point in recent advancements. However, the predominant focus …
Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-
LLMs) designed to enhance spatial-temporal modeling and audio understanding in video …
LLMs) designed to enhance spatial-temporal modeling and audio understanding in video …
Large models for time series and spatio-temporal data: A survey and outlook
Temporal data, notably time series and spatio-temporal data, are prevalent in real-world
applications. They capture dynamic system measurements and are produced in vast …
applications. They capture dynamic system measurements and are produced in vast …
Streaming dense video captioning
An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …
video--should be able to handle long input videos predict rich detailed textual descriptions …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Pllava: Parameter-free llava extension from images to videos for video dense captioning
Vision-language pre-training has significantly elevated performance across a wide range of
image-language applications. Yet, the pre-training process for video-related tasks demands …
image-language applications. Yet, the pre-training process for video-related tasks demands …
Streaming long video understanding with large language models
This paper presents VideoStreaming, an advanced vision-language large model (VLLM) for
video understanding, that capably understands arbitrary-length video with a constant …
video understanding, that capably understands arbitrary-length video with a constant …