The llama 3 herd of models

A Dubey, A Jauhri, A Pandey, A Kadian… - arxiv preprint arxiv …, 2024 - arxiv.org
Modern artificial intelligence (AI) systems are powered by foundation models. This paper
presents a new set of foundation models, called Llama 3. It is a herd of language models …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Unibench: Visual reasoning requires rethinking vision-language beyond scaling

H Al-Tahan, Q Garrido, R Balestriero… - arxiv preprint arxiv …, 2024 - arxiv.org
Significant research efforts have been made to scale and improve vision-language model
(VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers …

Ai safety in generative ai large language models: A survey

J Chua, Y Li, S Yang, C Wang, L Yao - arxiv preprint arxiv:2407.18369, 2024 - arxiv.org
Large Language Model (LLMs) such as ChatGPT that exhibit generative AI capabilities are
facing accelerated adoption and innovation. The increased presence of Generative AI (GAI) …

Vision language model for interpretable and fine-grained detection of safety compliance in diverse workplaces

Z Chen, H Chen, M Imani, R Chen, F Imani - Expert Systems with …, 2025 - Elsevier
Workplace accidents due to personal protective equipment (PPE) non-compliance raise
serious safety concerns and lead to legal liabilities, financial penalties, and reputational …

Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

A Wüst, T Tobiasch, L Helff, DS Dhami… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's GPT-4o,
have emerged, seemingly demonstrating advanced reasoning capabilities across text and …

FOCUS -- Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics

P Saha, F Wagner, D Mishra, C Peng, A Thakur… - arxiv preprint arxiv …, 2024 - arxiv.org
Effective training of large Vision-Language Models (VLMs) on resource-constrained client
devices in Federated Learning (FL) requires the usage of parameter-efficient fine-tuning …

Evaluation and comparison of visual language models for transportation engineering problems

S Prajapati, T Singh, C Hegde… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent developments in vision language models (VLM) have shown great potential for
diverse applications related to image understanding. In this study, we have explored state-of …

Omnixr: Evaluating omni-modality language models on reasoning across modalities

L Chen, H Hu, M Zhang, Y Chen, Z Wang, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality
Language Models, such as GPT-4o and Gemini. Evaluating OLMs, which integrate multiple …