Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Transformers are ssms: Generalized models and efficient algorithms through structured state space duality
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …
language modeling, state-space models (SSMs) such as Mamba have recently been shown …
Autoregressive model beats diffusion: Llama for scalable image generation
We introduce LlamaGen, a new family of image generation models that apply original``next-
token prediction''paradigm of large language models to visual generation domain. It is an …
token prediction''paradigm of large language models to visual generation domain. It is an …
Paligemma: A versatile 3b vlm for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …
unifies visual perception, understanding, and generation within a single framework. Unlike …
Show-o: One single transformer to unify multimodal understanding and generation
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
[HTML][HTML] A survey of robot intelligence with large language models
Since the emergence of ChatGPT, research on large language models (LLMs) has actively
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited …
Longvila: Scaling long-context visual language models for long videos
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
Vision language models are blind
Large language models (LLMs) with vision capabilities (eg, GPT-4o, Gemini 1.5, and Claude
3) are powering countless image-text processing applications, enabling unprecedented …
3) are powering countless image-text processing applications, enabling unprecedented …
Towards semantic equivalence of tokenization in multimodal llm
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …
processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization …