Crowdsourcing database systems: Overview and challenges
Many data management and analytics tasks, such as entity resolution, cannot be solely
addressed by automated processes. Crowdsourcing is an effective way to harness the …
addressed by automated processes. Crowdsourcing is an effective way to harness the …
Data management for machine learning: A survey
Machine learning (ML) has widespread applications and has revolutionized many
industries, but suffers from several challenges. First, sufficient high-quality training data is …
industries, but suffers from several challenges. First, sufficient high-quality training data is …
Selective data acquisition in the wild for model charging
The lack of sufficient labeled data is a key bottleneck for practitioners in many real-world
supervised machine learning (ML) tasks. In this paper, we study a new problem, namely …
supervised machine learning (ML) tasks. In this paper, we study a new problem, namely …
Domain adaptation for deep entity resolution
Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA)
results on ER are achieved by deep learning (DL) based methods, trained with a lot of …
results on ER are achieved by deep learning (DL) based methods, trained with a lot of …
Feature augmentation with reinforcement learning
Sufficient good features are indispensable to train well-performed machine learning models.
However, it is com-mon that good features are not always enough, where feature …
However, it is com-mon that good features are not always enough, where feature …
Two-sided online micro-task assignment in spatial crowdsourcing
With the rapid development of smartphones, spatial crowdsourcing platforms are getting
popular. A foundational research of spatial crowdsourcing is to allocate micro-tasks to …
popular. A foundational research of spatial crowdsourcing is to allocate micro-tasks to …
Human-in-the-loop outlier detection
Outlier detection is critical to a large number of applications from finance fraud detection to
health care. Although numerous approaches have been proposed to automatically detect …
health care. Although numerous approaches have been proposed to automatically detect …
Goodcore: Data-effective and data-efficient machine learning through coreset selection over incomplete data
Given a dataset with incomplete data (eg, missing values), training a machine learning
model over the incomplete data requires two steps. First, it requires a data-effective step that …
model over the incomplete data requires two steps. First, it requires a data-effective step that …
DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing
Analyzing unstructured data, such as complex documents, has been a persistent challenge
in data processing. Large Language Models (LLMs) have shown promise in this regard …
in data processing. Large Language Models (LLMs) have shown promise in this regard …
Fluid: A blockchain based framework for crowdsourcing
Recently, crowdsourcing has emerged as a new computing paradigm to solve problems that
need human intrinsic, such as image annotation. However, there are two limitations in …
need human intrinsic, such as image annotation. However, there are two limitations in …