Parametric schema inference for massive JSON datasets

MA Baazizi, D Colazzo, G Ghelli, C Sartiani - The VLDB Journal, 2019‏ - Springer
In recent years, JSON established itself as a very popular data format for representing
massive data collections. JSON data collections are usually schemaless. While this ensures …

Parsing gigabytes of JSON per second

G Langdale, D Lemire - The VLDB Journal, 2019‏ - Springer
Abstract JavaScript Object Notation or JSON is a ubiquitous data exchange format on the
web. Ingesting JSON documents can become a performance bottleneck due to the sheer …

Filter before you parse: Faster analytics on raw data with sparser

S Palkar, F Abuzaid, P Bailis, M Zaharia - Proceedings of the VLDB …, 2018‏ - dl.acm.org
Exploratory big data applications often run on raw unstructured or semi-structured data
formats, such as JSON files or text logs. These applications can spend 80--90% of their …

A case study of {Processing-in-Memory} in {off-the-Shelf} systems

J Nider, C Mustard, A Zoltan, J Ramsden, L Liu… - 2021 USENIX Annual …, 2021‏ - usenix.org
We evaluate a new processing-in-memory (PIM) architecture from UPMEM that was built
and deployed in an off-the-shelf server. Systems designed to perform computing in or near …

JSON tiles: Fast analytics on semi-structured data

D Durner, V Leis, T Neumann - … of the 2021 International Conference on …, 2021‏ - dl.acm.org
Developers often prefer flexibility over upfront schema design, making semi-structured data
formats such as JSON increasingly popular. Large amounts of JSON data are therefore …

Jumpgate:{In-Network} Processing as a Service for Data Analytics

C Mustard, F Ruffy, A Gakhokidze… - 11th USENIX Workshop …, 2019‏ - usenix.org
In-network processing, where data is processed by special-purpose devices as it passes
over the network, is showing great promise at improving application performance, in …

Using selective memoization to defeat regular expression denial of service (ReDoS)

JC Davis, F Servant, D Lee - 2021 IEEE symposium on security …, 2021‏ - ieeexplore.ieee.org
Regular expressions (regexes) are a denial of service vector in most mainstream
programming languages. Recent empirical work has demonstrated that up to 10% of …

Speculative distributed CSV data parsing for big data analytics

C Ge, Y Li, E Eilebrecht, B Chandramouli… - Proceedings of the …, 2019‏ - dl.acm.org
There has been a recent flurry of interest in providing query capability on raw data in today's
big data systems. These raw data must be parsed before processing or use in analytics …

Predicate pushdown for data science pipelines

C Yan, Y Lin, Y He - Proceedings of the ACM on Management of Data, 2023‏ - dl.acm.org
Predicate pushdown is a widely adopted query optimization. Existing systems and prior work
mostly use pattern-matching rules to decide when a predicate can be pushed through …

A survey of JSON-compatible binary serialization specifications

JC Viotti, M Kinderkhedia - arxiv preprint arxiv:2201.02089, 2022‏ - arxiv.org
In this paper, we present the recent advances that highlight the characteristics of JSON-
compatible binary serialization specifications. We motivate the discussion by covering the …