Google Acadêmico

Z **, W Chen, X Guo, W He, Y Ding, B Hong… - Science China …, 2025 - Springer

For a long time, researchers have sought artificial intelligence (AI) that matches or exceeds
human intelligence. AI agents, which are artificial entities capable of sensing the …

Salvar Citar Citado por 732 Artigos relacionados Todas as 4 versões

[Free GPT-4]

[PDF] arxiv.org

A comprehensive overview of large language models

H Naveed, AU Khan, S Qiu, M Saqib, S Anwar… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language Models (LLMs) have recently demonstrated remarkable capabilities in
natural language processing tasks and beyond. This success of LLMs has led to a large …

Salvar Citar Citado por 713 Artigos relacionados Todas as 3 versões Ver em HTML

[Free GPT-4]

[PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer

Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Salvar Citar Citado por 726 Artigos relacionados Todas as 3 versões

[Free GPT-4]

[PDF] arxiv.org

Llava-onevision: Easy visual task transfer

B Li, Y Zhang, D Guo, R Zhang, F Li, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …

Salvar Citar Citado por 250 Artigos relacionados Todas as 3 versões Ver em HTML

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Salvar Citar Citado por 118 Artigos relacionados Todas as 3 versões

[Free GPT-4]

[PDF] arxiv.org

Llama-vid: An image is worth 2 tokens in large language models

Y Li, C Wang, J Jia - European Conference on Computer Vision, 2024 - Springer

In this work, we present a novel method to tackle the token generation challenge in Vision
Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current …

Salvar Citar Citado por 205 Artigos relacionados Todas as 2 versões

[Free GPT-4]

[PDF] arxiv.org

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

S Tong, E Brown, P Wu, S Woo, M Middepogu… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-
centric approach. While stronger language models can enhance multimodal capabilities, the …

Salvar Citar Citado por 172 Artigos relacionados Todas as 4 versões Ver em HTML

[Free GPT-4]

[PDF] nature.com

Evolutionary optimization of model merging recipes

T Akiba, M Shing, Y Tang, Q Sun, D Ha - Nature Machine Intelligence, 2025 - nature.com

Large language models (LLMs) have become increasingly capable, but their development
often requires substantial computational resources. Although model merging has emerged …

Salvar Citar Citado por 74 Artigos relacionados Todas as 3 versões

[Free GPT-4]

[PDF] acm.org

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Salvar Citar Citado por 77 Artigos relacionados

[Free GPT-4]

[PDF] arxiv.org

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Z Guo, R Xu, Y Yao, J Cui, Z Ni, C Ge, TS Chua… - … on Computer Vision, 2024 - Springer

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding
the visual world. Conventional LMMs process images in fixed sizes and limited resolutions …

Salvar Citar Citado por 93 Artigos relacionados Todas as 2 versões

Criar alerta

Citar

Pesquisa avançada

Salvo em "Minha biblioteca"

Instructblip: Towards general-purpose vision-language models with instruction tuning

The rise and potential of large language model based agents: A survey

A comprehensive overview of large language models

Mmbench: Is your multi-modal model an all-around player?

Llava-onevision: Easy visual task transfer

Internvideo2: Scaling foundation models for multimodal video understanding

Llama-vid: An image is worth 2 tokens in large language models

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Evolutionary optimization of model merging recipes

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images