In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures sev- eral advantages, the absence of schema information has im- portant negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give com- plete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implemen- tation based on Spark, enabling reasonable schema infer- ence time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our ap- proach in terms of execution time, precision, and conciseness of inferred schemas, and scalability.

Schema Inference for Massive JSON Datasets

SARTIANI, CARLO
2017-01-01

Abstract

In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures sev- eral advantages, the absence of schema information has im- portant negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give com- plete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implemen- tation based on Spark, enabling reasonable schema infer- ence time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our ap- proach in terms of execution time, precision, and conciseness of inferred schemas, and scalability.
2017
978-3-89318-073-8
File in questo prodotto:
File Dimensione Formato  
paper-62.pdf

accesso aperto

Tipologia: Pdf editoriale
Licenza: Creative commons
Dimensione 1.45 MB
Formato Adobe PDF
1.45 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/125992
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 63
  • ???jsp.display-item.citation.isi??? ND
social impact