Redhot: A corpus of annotated medical questions, experiences, and claims on social media

S Wadhwa, V Khetan, S Amir, B Wallace - arxiv preprint arxiv:2210.06331, 2022‏ - arxiv.org
We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social
media posts from Reddit spanning 24 health conditions. Annotations include demarcations …

Crowdspeech and voxdiy: Benchmark datasets for crowdsourced audio transcription

N Pavlichenko, I Stelmakh, D Ustalov - arxiv preprint arxiv:2107.01091, 2021‏ - arxiv.org
Domain-specific data is the crux of the successful transfer of machine learning systems from
benchmarks to real life. In simple problems such as image classification, crowdsourcing has …

SemEval-2023 task 8: Causal medical claim identification and related PIO frame extraction from social media posts

V Khetan, S Wadhwa, BC Wallace… - Proceedings of the 17th …, 2023‏ - aclanthology.org
Identification of medical claims from user-generated text data is an onerous but essential
step for various tasks including content moderation, and hypothesis generation. SemEval …

Resolving the human subjects status of machine learning's crowdworkers

D Kaushik, ZC Lipton, AJ London - arxiv preprint arxiv:2206.04039, 2022‏ - arxiv.org
In recent years, machine learning (ML) has relied heavily on crowdworkers both for building
datasets and for addressing research questions requiring human interaction or judgment …

Data labeling for machine learning engineers: project-based curriculum and data-centric competitions

A Zhdanovskaya, D Baidakova, D Ustalov - Proceedings of the AAAI …, 2023‏ - ojs.aaai.org
The process of training and evaluating machine learning (ML) models relies on high-quality
and timely annotated datasets. While a significant portion of academic and industrial …

[PDF][PDF] Song describer: a platform for collecting textual descriptions of music recordings

I Manco, B Weck, P Tovstogan, M Won… - Ismir 2022 Hybrid …, 2022‏ - archives.ismir.net
ABSTRACT We present Song Describer, an open-source data annotation platform for
crowdsourcing textual descriptions of music recordings. Through this tool, we propose to …

A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

FDL Fornaciari, B Altuna, I Gonzalez-Dios… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this work, we explore idiomatic language processing with Large Language Models
(LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult …

REGROW: Reimagining global crowdsourcing for better human-AI collaboration

A Alorwu, S Savage, N van Berkel, D Ustalov… - CHI Conference on …, 2022‏ - dl.acm.org
Crowdworkers silently enable much of today's AI-based products, with several online
platforms offering a myriad of data labelling and content moderation tasks through …

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

E Artemova, A Tsvigun, D Schlechtweg… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Training and deploying machine learning models relies on a large amount of human-
annotated data. As human labeling becomes increasingly expensive and time-consuming …

[PDF][PDF] Robustifying NLP with Humans in the Loop

D Kaushik - 2022‏ - cs.cmu.edu
Despite machine learning (ML)'s many practical breakthroughs, formidable obstacles
obstruct its deployment in consequential applications. Modern ML models have repeatedly …