Bridging the Data Provenance Gap Across Text, Speech and Video

S Longpre, N Singh, M Cherep, K Tiwary… - arxiv preprint arxiv …, 2024 - arxiv.org
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is
a deficit of empirical analysis examining the attributes of well-established datasets beyond …

Large Multimodal Models for Low-Resource Languages: A Survey

M Lupascu, AC Rogoz, MS Stupariu… - arxiv preprint arxiv …, 2025 - arxiv.org
In this survey, we systematically analyze techniques used to adapt large multimodal models
(LMMs) for low-resource (LR) languages, examining approaches ranging from visual …

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia

C Liu, W Zhang, J Ying, M Aljunied, AT Luu… - arxiv preprint arxiv …, 2025 - arxiv.org
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to
evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) …