Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey of multimodal large language model from a data-centric perspective
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …
language models by integrating and processing data from multiple modalities, including text …
Contrastive localized language-image pre-training
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …
vision encoders to generate image/text representations facilitating various applications …
Revisit large-scale image-caption data in pre-training multimodal foundation models
Recent advancements in multimodal models highlight the value of rewritten captions for
improving performance, yet key challenges remain. For example, while synthetic captions …
improving performance, yet key challenges remain. For example, while synthetic captions …
Scaling inference-time search with vision value model for improved visual comprehension
Despite significant advancements in vision-language models (VLMs), there lacks effective
approaches to enhance response quality by scaling inference-time computation. This …
approaches to enhance response quality by scaling inference-time computation. This …
Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …
training (CLIP) has made significant strides, becoming foundation for various downstream …
Detect, describe, discriminate: Moving beyond vqa for mllm evaluation
Visual Question Answering (VQA) with multiple choice questions enables a vision-centric
evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the …
evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the …
Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding
Recently, video-language understanding has achieved great success through large-scale
pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively …
pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively …
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions
Previous works show that noisy, web-crawled image-text pairs may limit vision-language
pretraining like CLIP and propose learning with synthetic captions as a promising …
pretraining like CLIP and propose learning with synthetic captions as a promising …
STIV: Scalable Text and Image Conditioned Video Generation
The field of video generation has made remarkable advancements, yet there remains a
pressing need for a clear, systematic recipe that can guide the development of robust and …
pressing need for a clear, systematic recipe that can guide the development of robust and …
TIPS: Text-Image Pretraining with Spatial Awareness
While image-text representation learning has become very popular in recent years, existing
models tend to lack spatial awareness and have limited direct applicability for dense …
models tend to lack spatial awareness and have limited direct applicability for dense …