- Academic Search

R Gruetzemacher, D Paradice - ACM Computing Surveys (CSUR), 2022 - dl.acm.org

AI is widely thought to be poised to transform business, yet current perceptions of the scope
of this transformation may be myopic. Recent progress in natural language processing …

บันทึก อ้างอิง อ้างโดย58 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

L Huang, W Yu, W Ma, W Zhong, Z Feng… - ACM Transactions on …, 2025 - dl.acm.org

The emergence of large language models (LLMs) has marked a significant breakthrough in
natural language processing (NLP), fueling a paradigm shift in information acquisition …

บันทึก อ้างอิง อ้างโดย951 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Deduplicating training data makes language models better

K Lee, D Ippolito, A Nystrom, C Zhang, D Eck… - arxiv preprint arxiv …, 2021 - arxiv.org

We find that existing language modeling datasets contain many near-duplicate examples
and long repetitive substrings. As a result, over 1% of the unprompted output of language …

บันทึก อ้างอิง อ้างโดย610 บทความที่เกี่ยวข้อง ทั้งหมด 7 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Dothash: estimating set similarity metrics for link prediction and document deduplication

I Nunes, M Heddes, P Vergés, D Abraham… - Proceedings of the 29th …, 2023 - dl.acm.org

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate
results in a Web search, for example, a common approach looks at the Jaccard index …

บันทึก อ้างอิง อ้างโดย9 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] nber.org

Noise-robust de-duplication at scale

E Silcock, L D'Amico-Wong, J Yang, M Dell - 2022 - nber.org

Identifying near duplicates within large, noisy text corpora has a myriad of applications that
range from de-duplicating training datasets, reducing privacy risk, and evaluating test set …

บันทึก อ้างอิง อ้างโดย15 บทความที่เกี่ยวข้อง ทั้งหมด 13 ฉบับ Library Search

Connected Components for Scaling Partial-order Blocking to Billion Entities

T Backes, S Dietze - ACM Journal of Data and Information Quality, 2024 - dl.acm.org

In entity resolution, blocking pre-partitions data for further processing by more expensive
methods. Two entity mentions are in the same block if they share identical or related …

บันทึก อ้างอิง อ้างโดย1 บทความที่เกี่ยวข้อง

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Privacy-preserving record linkage using local sensitive hash and private set intersection

A Adir, E Aharoni, N Drucker, E Kushnir… - … Conference on Applied …, 2022 - Springer

The amount of data stored in data repositories increases every year. This makes it
challenging to link records between different datasets across companies and even …

บันทึก อ้างอิง อ้างโดย9 บทความที่เกี่ยวข้อง ทั้งหมด 8 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] beei.org

Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database

MM Amin, D Stiawan, E Ermatita, R Budiarto - Bulletin of Electrical …, 2024 - beei.org

Bibliographic databases are used to measure the performance of researchers, universities
and research institutions. Thus, high data quality is required and data duplication is avoided …

บันทึก อ้างอิง อ้างโดย1 บทความที่เกี่ยวข้อง ทั้งหมด 4 ฉบับ ดูในรูปแบบ HTML

[Free GPT-4]
[DeepSeek]

[PDF] upenn.edu

Understanding the limitations of using large language models for text generation

D Ippolito - 2023 - search.proquest.com

State-of-the-art neural language models are capable of generating incredibly fluent English
text. This success provides opportunities for novel forms of interaction, where human writers …

บันทึก อ้างอิง อ้างโดย4 บทความที่เกี่ยวข้อง ทั้งหมด 6 ฉบับ

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Privacy-preserving Fuzzy Name Matching for Sharing Financial Intelligence

H Kasyap, UI Atmaca, C Maple, G Cormode… - arxiv preprint arxiv …, 2024 - arxiv.org

Financial institutions rely on data for many operations, including a need to drive efficiency,
enhance services and prevent financial crime. Data sharing across an organisation or …

บันทึก อ้างอิง บทความที่เกี่ยวข้อง ดูในรูปแบบ HTML

สร้างการแจ้งเตือน

อ้างอิง

การค้นหาขั้นสูง

บันทึกไปยังคลังของฉันแล้ว

Deduplication of scholarly documents using locality sensitive hashing and word embeddings

Deep transfer learning & beyond: Transformer language models in information systems research

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Deduplicating training data makes language models better

Dothash: estimating set similarity metrics for link prediction and document deduplication

Noise-robust de-duplication at scale

Connected Components for Scaling Partial-order Blocking to Billion Entities

Privacy-preserving record linkage using local sensitive hash and private set intersection

Proposed threshold-based and rule-based approaches to detecting duplicates in bibliographic database

Understanding the limitations of using large language models for text generation

Privacy-preserving Fuzzy Name Matching for Sharing Financial Intelligence