Wikibench: Community-driven data curation for ai evaluation on wikipedia

TS Kuo, AL Halfaker, Z Cheng, J Kim, MH Wu… - Proceedings of the CHI …, 2024 - dl.acm.org
AI tools are increasingly deployed in community contexts. However, datasets used to
evaluate AI are typically created by developers and annotators outside a given community …

" Yeah, this graph doesn't show that": Analysis of Online Engagement with Misleading Data Visualizations

M Lisnic, A Lex, M Kogan - Proceedings of the CHI Conference on …, 2024 - dl.acm.org
Attempting to make sense of a phenomenon or crisis, social media users often share data
visualizations and interpretations that can be erroneous or misleading. Prior work has …

What Constitutes a Faithful Summary? Preserving Author Perspectives in News Summarization

Y Liu, S Feng, X Han, V Balachandran, CY Park… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we take a first step towards designing summarization systems that are faithful to
the author's opinions and perspectives. Focusing on a case study of preserving political …

DISCERN: Designing Decision Support Interfaces to Investigate the Complexities of Workplace Social Decision-Making With Line Managers

P Khadpe, L Le, K Nowak, ST Iqbal, J Suh - Proceedings of the CHI …, 2024 - dl.acm.org
Line managers form the first level of management in organizations, and must make complex
decisions, while maintaining relationships with those impacted by their decisions. Amidst …

WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI

WH Deng, C Wang, HZ Han, JI Hong, K Holstein… - arxiv preprint arxiv …, 2025 - arxiv.org
There has been growing interest from both practitioners and researchers in engaging end
users in AI auditing, to draw upon users' unique knowledge and lived experiences …

A Framework for Evaluating LLMs Under Task Indeterminacy

L Guerdan, H Wallach, S Barocas… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language model (LLM) evaluations often assume there is a single correct response--a
gold label--for each item in the evaluation corpus. However, some tasks can be ambiguous …

PolicyCraft: Supporting Collaborative and Participatory Policy Design through Case-Grounded Deliberation

TS Kuo, QZ Chen, AX Zhang, J Hsieh, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Community and organizational policies are typically designed in a top-down, centralized
fashion, with limited input from impacted stakeholders. This can result in policies that are …

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game

P Samadarshi, M Mustafa, A Kulkarni… - arxiv preprint arxiv …, 2024 - arxiv.org
The New York Times Connections game has emerged as a popular and challenging pursuit
for word puzzle enthusiasts. We collect 200 Connections games to evaluate the …

Paper Copilot: The Artificial Intelligence and Machine Learning Community Should Adopt a More Transparent and Regulated Peer Review Process

J Yang - arxiv preprint arxiv:2502.00874, 2025 - arxiv.org
The rapid growth of submissions to top-tier Artificial Intelligence (AI) and Machine Learning
(ML) conferences has prompted many venues to transition from closed to open review …

Automating Annotation Guideline Improvements using LLMs: A Case Study

A Bibal, N Gerlek, G Muric, E Boschee… - … of Context and …, 2025 - aclanthology.org
Annotating texts can be a tedious task, especially when texts are noisy. At the root of the
issue, guidelines are not always optimized enough to be able to perform the required …