Deep transfer learning & beyond: Transformer language models in information systems research

R Gruetzemacher, D Paradice - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
AI is widely thought to be poised to transform business, yet current perceptions of the scope
of this transformation may be myopic. Recent progress in natural language processing …

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

L Huang, W Yu, W Ma, W Zhong, Z Feng… - ACM Transactions on …, 2025 - dl.acm.org
The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …

Deduplicating training data makes language models better

K Lee, D Ippolito, A Nystrom, C Zhang, D Eck… - arxiv preprint arxiv …, 2021 - arxiv.org
We find that existing language modeling datasets contain many near-duplicate examples
and long repetitive substrings. As a result, over 1% of the unprompted output of language …

Dothash: estimating set similarity metrics for link prediction and document deduplication

I Nunes, M Heddes, P Vergés, D Abraham… - Proceedings of the 29th …, 2023 - dl.acm.org
Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate
results in a Web search, for example, a common approach looks at the Jaccard index …

Noise-robust de-duplication at scale

E Silcock, L D'Amico-Wong, J Yang, M Dell - 2022 - nber.org
Identifying near duplicates within large, noisy text corpora has a myriad of applications that
range from de-duplicating training datasets, reducing privacy risk, and evaluating test set …

Connected Components for Scaling Partial-order Blocking to Billion Entities

T Backes, S Dietze - ACM Journal of Data and Information Quality, 2024 - dl.acm.org
In entity resolution, blocking pre-partitions data for further processing by more expensive
methods. Two entity mentions are in the same block if they share identical or related …

Privacy-preserving record linkage using local sensitive hash and private set intersection

A Adir, E Aharoni, N Drucker, E Kushnir… - … Conference on Applied …, 2022 - Springer
The amount of data stored in data repositories increases every year. This makes it
challenging to link records between different datasets across companies and even …

Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database

MM Amin, D Stiawan, E Ermatita, R Budiarto - Bulletin of Electrical …, 2024 - beei.org
Bibliographic databases are used to measure the performance of researchers, universities
and research institutions. Thus, high data quality is required and data duplication is avoided …

Understanding the limitations of using large language models for text generation

D Ippolito - 2023 - search.proquest.com
State-of-the-art neural language models are capable of generating incredibly fluent English
text. This success provides opportunities for novel forms of interaction, where human writers …

Privacy-preserving Fuzzy Name Matching for Sharing Financial Intelligence

H Kasyap, UI Atmaca, C Maple, G Cormode… - arxiv preprint arxiv …, 2024 - arxiv.org
Financial institutions rely on data for many operations, including a need to drive efficiency,
enhance services and prevent financial crime. Data sharing across an organisation or …