Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

L Wang, X Chen, J Zhao, K He - Advances in Neural …, 2025 - proceedings.neurips.cc
One of the roadblocks for training generalist robotic models today is heterogeneity. Previous
robot learning methods often collect data to train with one specific embodiment for one task …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

W Wang, Z Chen, W Wang, Y Cao, Y Liu, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …

Aria: An open multimodal native mixture-of-experts model

D Li, Y Liu, H Wu, Y Wang, Z Shen, B Qu, X Niu… - arxiv preprint arxiv …, 2024 - arxiv.org
Information comes in diverse modalities. Multimodal native AI models are essential to
integrate real-world information and deliver comprehensive understanding. While …

Your mixture-of-experts llm is secretly an embedding model for free

Z Li, T Zhou - arxiv preprint arxiv:2410.10814, 2024 - arxiv.org
While large language models (LLMs) excel on generation tasks, their decoder-only
architecture often limits their potential as embedding models if no further representation …

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

W Liang, L Yu, L Luo, S Iyer, N Dong, C Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of large language models (LLMs) has expanded to multi-modal systems
capable of processing text, images, and speech within a unified framework. Training these …

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

NV Nguyen, TT Doan, L Tran, V Nguyen… - arxiv preprint arxiv …, 2024 - arxiv.org
Mixture of Experts (MoEs) plays an important role in the development of more efficient and
effective large language models (LLMs). Due to the enormous resource requirements …

A Survey of Embodied AI in Healthcare: Techniques, Applications, and Opportunities

Y Liu, X Cao, T Chen, Y Jiang, J You, M Wu… - arxiv preprint arxiv …, 2025 - arxiv.org
Healthcare systems worldwide face persistent challenges in efficiency, accessibility, and
personalization. Powered by modern AI technologies such as multimodal large language …

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

W Shi, X Han, C Zhou, W Liang, XV Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We present LlamaFusion, a framework for empowering pretrained text-only large language
models (LLMs) with multimodal generative capabilities, enabling them to understand and …

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

H Diao, X Li, Y Cui, Y Wang, H Deng, T Pan… - arxiv preprint arxiv …, 2025 - arxiv.org
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the
performance gap with their encoder-based counterparts, highlighting the promising potential …