Chain of thoughtlessness? an analysis of cot in planning

K Stechly, K Valmeekam… - Advances in Neural …, 2025 - proceedings.neurips.cc
Large language model (LLM) performance on reasoning problems typically does not
generalize out of distribution. Previous work has claimed that this can be mitigated with …

Eureka: Evaluating and understanding large foundation models

V Balachandran, J Chen, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org
Rigorous and reproducible evaluation is critical for assessing the state of the art and for
guiding scientific advances in Artificial Intelligence. Evaluation is challenging in practice due …

“I Want It That Way”: Enabling Interactive Decision Support Using Large Language Models and Constraint Programming

C Lawless, J Schoeffer, L Le, K Rowan, S Sen… - ACM Transactions on …, 2024 - dl.acm.org
A critical factor in the success of many decision support systems is the accurate modeling of
user preferences. Psychology research has demonstrated that users often develop their …

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

N Butt, V Chandrasekaran, N Joshi, B Nushi… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluations are limited by benchmark availability. As models evolve, there is a need to
create benchmarks that can measure progress on new generative capabilities. However …

From instructions to constraints: Language model alignment with automatic constraint verification

F Wang, C Shang, S Jain, S Wang, Q Ning… - arxiv preprint arxiv …, 2024 - arxiv.org
User alignment is crucial for adapting general-purpose language models (LMs) to
downstream tasks, but human annotations are often not available for all types of instructions …

Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models

KU Qasim, J Zhang, T Alsahfi, AUR Butt - arxiv preprint arxiv:2501.02026, 2025 - arxiv.org
Enhancing the reasoning capabilities of Large Language Models remains a critical
challenge in artificial intelligence. We introduce RDoLT, Recursive Decomposition of Logical …

The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

L Madmoni, A Zait, I Labzovsky, D Karmon - arxiv preprint arxiv …, 2024 - arxiv.org
Generative AI agents are often expected to respond to complex user requests that have No
One Right Answer (NORA), eg," design a vegetarian meal plan below 1800 calories". Such …

[HTML][HTML] Aligning to constraints for data-efficient language model customization

F Wang, C Shang, S Wang, S Jain, Q Ning, B Min… - 2025 - amazon.science
General-purpose language models (LMs) are aligned to diverse user intents, but fall short
when it comes to specific applications. While finetuning is the default method for customized …

[КНИГА][B] Towards Trustworthy Machine Learning: An Integer Programming Approach

CA Lawless - 2024 - search.proquest.com
Despite the proliferation of machine learning (ML) in a multitude of applications, current
black-box models, such as deep learning, remain hard to understand, critique, and judge by …