Mint: Evaluating llms in multi-turn interaction with tools and language feedback
To solve complex tasks, large language models (LLMs) often require multiple rounds of
interactions with the user, sometimes assisted by external tools. However, current evaluation …
interactions with the user, sometimes assisted by external tools. However, current evaluation …
Agent-as-a-judge: Evaluate agents with agents
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …
Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records
Clinicians often rely on data engineers to retrieve complex patient information from
electronic health record (EHR) systems, a process that is both inefficient and time …
electronic health record (EHR) systems, a process that is both inefficient and time …
Advancing llm reasoning generalists with preference trees
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art …
Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art …
Generative AI Agents for Knowledge Work Augmentation in Finance
The development of software agents that can autonomously take actions to achieve goals
has been a long-standing foundational objective in the field of AI. Recent advances in …
has been a long-standing foundational objective in the field of AI. Recent advances in …
Learning to use tools via cooperative and interactive agents
Tool learning empowers large language models (LLMs) as agents to use external tools and
extend their utility. Existing methods employ one single LLM-based agent to iteratively select …
extend their utility. Existing methods employ one single LLM-based agent to iteratively select …
Codexgraph: Bridging large language models and code repositories via code graph databases
Large Language Models (LLMs) excel in stand-alone code tasks like HumanEval and
MBPP, but struggle with handling entire code repositories. This challenge has prompted …
MBPP, but struggle with handling entire code repositories. This challenge has prompted …
A Single Transformer for Scalable Vision-Language Modeling
We present SOLO, a single transformer for Scalable visiOn-Language mOdeling. Current
large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous …
large vision-language models (LVLMs) such as LLaVA mostly employ heterogeneous …
Waitgpt: Monitoring and steering conversational llm agent in data analysis with on-the-fly code visualization
Large language models (LLMs) support data analysis through conversational user
interfaces, as exemplified in OpenAI's ChatGPT (formally known as Advanced Data Analysis …
interfaces, as exemplified in OpenAI's ChatGPT (formally known as Advanced Data Analysis …
Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities
Recent large language models (LLMs) advancements sparked a growing research interest
in tool assisted LLMs solving real-world challenges, which calls for comprehensive …
in tool assisted LLMs solving real-world challenges, which calls for comprehensive …