On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records: Experience from a R&D project

W Andrzejewski, B Bębel, P Boiński, R Wrembel - Information Systems, 2024 - Elsevier
Data stored in information systems are often erroneous. Duplicate data are one of the typical
error type. To discover and handle duplicates, the so-called deduplication methods are …

Data integration revitalized: From data warehouse through data lake to data mesh

R Wrembel - International Conference on Database and Expert …, 2023 - Springer
For years, data integration (DI) architectures evolved from those supporting virtual
integration, through physical integration, to those supporting both virtual and physical …

On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication Pipeline: Industrial Experience Report

P Boiński, W Andrzejewski, B Bębel… - … Conference on Database …, 2023 - Springer
Assuring high quality of data stored in information systems (ISs) is challenging and it is one
of concerns of companies. Typically, data stored in ISs are not free from errors, which …

Meningkatkan Deduplikasi Data melalui Kesamaan Teks dalam Pembelajaran Mesin: Pendekatan Komprehensif

A Handijono, Z Suhatman - AKADEMIK: Jurnal Mahasiswa Humanis, 2024 - ojs.pseb.or.id
The issue of dirty data, particularly duplicate data, is a common problem in data
management that can affect data quality, operational efficiency, and decision-making. This …

On Customer Data Deduplication-Research vs. Industrial Perspective: Lessons Learned from a R&D Project in the Financial Sector

W Andrzejewski, B Bębel, P Boiński… - European Conference on …, 2024 - Springer
In this tutorial we present the results of researching, designing, implementing, and deploying
data deduplication pipelines for customer records in a big financial institution. The tutorial is …

On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records

Data stored in information systems are often erroneous. Duplicate data are one of the typical
error type. To discover and handle duplicates, the so-called deduplication methods are …

[PDF][PDF] Statistical Modeling vs. Machine Learning for Deduplication of Customer Records (industrial paper)

W Andrzejewski, B Bębel, P Boiński, J Kowalewska… - 2024 - ceur-ws.org
Large companies typically face a problem of multiple database records describing the same
physical object (aka duplicates). There are multiple sources of duplicates, eg, using multiple …

On Customer Data Deduplication-Research vs. Industrial Perspective: Lessons Learned from

R Wrembel - New Trends in Database and Information Systems - Springer
In this tutorial we present the results of researching, designing, implementing, and deploying
data deduplication pipelines for customer records in a big financial institution. The tutorial is …