Comparison of text preprocessing methods

CP Chai - Natural Language Engineering, 2023 - cambridge.org
Text preprocessing is not only an essential step to prepare the corpus for modeling but also
a key area that directly affects the natural language processing (NLP) application results. For …

Outline, then details: Syntactically guided coarse-to-fine code generation

W Zheng, SP Sharan, AK Jaiswal… - International …, 2023 - proceedings.mlr.press
For a complicated algorithm, its implementation by a human programmer usually starts with
outlining a rough control flow followed by iterative enrichments, eventually yielding carefully …

Natural software revisited

M Rahman, D Palani, PC Rigby - 2019 IEEE/ACM 41st …, 2019 - ieeexplore.ieee.org
Recent works have concluded that software code is more repetitive and predictable, ie more
natural, than English texts. On re-examination, we find that much of the apparent" …

Labeling hacker exploits for proactive cyber threat intelligence: A deep transfer learning approach

B Ampel, S Samtani, H Zhu, S Ullman… - … on intelligence and …, 2020 - ieeexplore.ieee.org
With the rapid development of new technologies, vulnerabilities are at an all-time high.
Companies are investing in develo** Cyber Threat Intelligence (CTI) to counteract these …

Codebert-nt: code naturalness via codebert

A Khanfir, M Jimenez, M Papadakis… - 2022 IEEE 22nd …, 2022 - ieeexplore.ieee.org
Much of recent software-engineering research has investigated the naturalness of code, the
fact that code, in small code snippets, is repetitive and can be predicted using statistical …

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

MT Alrefaie, NE Morsy, N Samir - arxiv preprint arxiv:2403.11130, 2024 - arxiv.org
This paper presents a comprehensive examination of the impact of tokenization strategies
and vocabulary sizes on the performance of Arabic language models in downstream natural …

Hoax detection system on Twitter using feed-Forward and back-propagation neural networks classification method

CW Kencana, EB Setiawan, I Kurniawan - Jurnal RESTI (Rekayasa …, 2020 - jurnal.iaii.or.id
Social media is one of the ways to connect every individual in the world. It also used by
irresponsible people to spread a hoax. Hoax is false news that is made as if it is true. It may …

Are mutants really natural? a study on how" naturalness" helps mutant selection

M Jimenez, TT Checkam, M Cordy… - Proceedings of the 12th …, 2018 - dl.acm.org
Background: Code is repetitive and predictable in a way that is similar to the natural
language. This means that code is" natural" and this" naturalness" can be captured by …

[PDF][PDF] Enabling the continous analysis of security vulnerabilities with vuldata7

M Jimenez, Y Le Traon, M Papadakis - 18th IEEE International Working …, 2018 - orbilu.uni.lu
Studies on security vulnerabilities require the analysis, investigation and comprehension of
real vulnerable code instances. However, collecting and experimenting with a sufficient …

Exploring the Landscape of Programming Language Identification with Machine Learning Approaches

A Verma, R Saha, G Kumar, A Brighente, M Conti… - IEEE …, 2025 - ieeexplore.ieee.org
The increasing complexity of modern software development necessitates tools and
methodologies for code analysis, maintenance, and migration in multi-language Integrated …