Mint: Evaluating llms in multi-turn interaction with tools and language feedback

X Wang, Z Wang, J Liu, Y Chen, L Yuan… - arxiv preprint arxiv …, 2023 - arxiv.org
To solve complex tasks, large language models (LLMs) often require multiple rounds of
interactions with the user, sometimes assisted by external tools. However, current evaluation …

Agent-as-a-judge: Evaluate agents with agents

M Zhuge, C Zhao, D Ashley, W Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …

Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records

W Shi, R Xu, Y Zhuang, Y Yu, J Zhang… - Proceedings of the …, 2024 - aclanthology.org
Clinicians often rely on data engineers to retrieve complex patient information from
electronic health record (EHR) systems, a process that is both inefficient and time …

Advancing llm reasoning generalists with preference trees

L Yuan, G Cui, H Wang, N Ding, X Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art …

Generative AI Agents for Knowledge Work Augmentation in Finance

S Ganesh, L Ardon, D Borrajo, D Garg… - Annual Review of …, 2024 - annualreviews.org
The development of software agents that can autonomously take actions to achieve goals
has been a long-standing foundational objective in the field of AI. Recent advances in …

Learning to use tools via cooperative and interactive agents

Z Shi, S Gao, X Chen, Y Feng, L Yan, H Shi… - arxiv preprint arxiv …, 2024 - arxiv.org
Tool learning empowers large language models (LLMs) as agents to use external tools and
extend their utility. Existing methods employ one single LLM-based agent to iteratively select …

Codexgraph: Bridging large language models and code repositories via code graph databases

X Liu, B Lan, Z Hu, Y Liu, Z Zhang, F Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and
MBPP, but struggle with handling entire code repositories. This challenge has prompted …

A Single Transformer for Scalable Vision-Language Modeling

Y Chen, X Wang, H Peng, H Ji - arxiv preprint arxiv:2407.06438, 2024 - arxiv.org
We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current
large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous …

Waitgpt: Monitoring and steering conversational llm agent in data analysis with on-the-fly code visualization

L **e, C Zheng, H **a, H Qu, C Zhu-Tian - Proceedings of the 37th …, 2024 - dl.acm.org
Large language models (LLMs) support data analysis through conversational user
interfaces, as exemplified in OpenAI's ChatGPT (formally known as Advanced Data Analysis …

Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities

J Lu, T Holleis, Y Zhang, B Aumayer, F Nan… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent large language models (LLMs) advancements sparked a growing research interest
in tool assisted LLMs solving real-world challenges, which calls for comprehensive …