Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Z **ong, Z Cai, J Cooper, A Ge, V Papageorgiou… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL)
capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can …

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Z Mi, KC Wang, G Qian, H Ye, R Liu, S Tulyakov… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image
diffusion models with multimodal in-context understanding and reasoning capabilities by …

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

Y Zong, O Bohdal, T Hospedales - arxiv preprint arxiv:2403.13164, 2024 - arxiv.org
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)--the
ability to rapidly adapt to new tasks using few-shot examples provided as a prompt, without …

LoRA. rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation

D Shenaj, O Bohdal, M Ozay, P Zanuttigh… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in image generation models have enabled personalized image
creation with both user-defined subjects (content) and styles. Prior works achieved …

MemeSense: An Adaptive In-Context Framework for Social Commonsense Driven Meme Moderation

S Adak, S Banerjee, R Mandal, A Halder… - arxiv preprint arxiv …, 2025 - arxiv.org
Memes present unique moderation challenges due to their subtle, multimodal interplay of
images, text, and social context. Standard systems relying predominantly on explicit textual …