Wikibench: Community-driven data curation for ai evaluation on wikipedia
AI tools are increasingly deployed in community contexts. However, datasets used to
evaluate AI are typically created by developers and annotators outside a given community …
evaluate AI are typically created by developers and annotators outside a given community …
" Yeah, this graph doesn't show that": Analysis of Online Engagement with Misleading Data Visualizations
Attempting to make sense of a phenomenon or crisis, social media users often share data
visualizations and interpretations that can be erroneous or misleading. Prior work has …
visualizations and interpretations that can be erroneous or misleading. Prior work has …
What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization
In this work, we take a first step towards designing summarization systems that are faithful to
the author's opinions and perspectives. Focusing on a case study of preserving political …
the author's opinions and perspectives. Focusing on a case study of preserving political …
DISCERN: Designing Decision Support Interfaces to Investigate the Complexities of Workplace Social Decision-Making With Line Managers
Line managers form the first level of management in organizations, and must make complex
decisions, while maintaining relationships with those impacted by their decisions. Amidst …
decisions, while maintaining relationships with those impacted by their decisions. Amidst …
WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI
There has been growing interest from both practitioners and researchers in engaging end
users in AI auditing, to draw upon users' unique knowledge and lived experiences …
users in AI auditing, to draw upon users' unique knowledge and lived experiences …
A Framework for Evaluating LLMs Under Task Indeterminacy
Large language model (LLM) evaluations often assume there is a single correct response--a
gold label--for each item in the evaluation corpus. However, some tasks can be ambiguous …
gold label--for each item in the evaluation corpus. However, some tasks can be ambiguous …
PolicyCraft: Supporting Collaborative and Participatory Policy Design through Case-Grounded Deliberation
Community and organizational policies are typically designed in a top-down, centralized
fashion, with limited input from impacted stakeholders. This can result in policies that are …
fashion, with limited input from impacted stakeholders. This can result in policies that are …
Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
P Samadarshi, M Mustafa, A Kulkarni… - arxiv preprint arxiv …, 2024 - arxiv.org
The New York Times Connections game has emerged as a popular and challenging pursuit
for word puzzle enthusiasts. We collect 200 Connections games to evaluate the …
for word puzzle enthusiasts. We collect 200 Connections games to evaluate the …
Paper Copilot: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process
J Yang - arxiv preprint arxiv:2502.00874, 2025 - arxiv.org
The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning
(ML) conferences has prompted many venues to transition from closed to open review …
(ML) conferences has prompted many venues to transition from closed to open review …
Automating Annotation Guideline Improvements using LLMs: A Case Study
Annotating texts can be a tedious task, especially when texts are noisy. At the root of the
issue, guidelines are not always optimized enough to be able to perform the required …
issue, guidelines are not always optimized enough to be able to perform the required …