We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.

Messing Up with BART: Error Generation for Evaluating Data-Cleaning Algorithms

MECCA, Giansalvatore;SANTORO, DONATELLO
2016-01-01

Abstract

We study the problem of introducing errors into clean databases for the purpose of benchmarking data-cleaning algorithms. Our goal is to provide users with the highest possible level of control over the error-generation process, and at the same time develop solutions that scale to large databases. We show in the paper that the error-generation problem is surprisingly challenging, and in fact, NP-complete. To provide a scalable solution, we develop a correct and efficient greedy algorithm that sacrifices completeness, but succeeds under very reasonable assumptions. To scale to millions of tuples, the algorithm relies on several non-trivial optimizations, including a new symmetry property of data quality constraints. The trade-off between control and scalability is the main technical contribution of the paper.
2016
File in questo prodotto:
File Dimensione Formato  
p191-arocena.pdf

solo utenti autorizzati

Descrizione: Versione camera-ready
Tipologia: Documento in Post-print
Licenza: DRM non definito
Dimensione 597.32 kB
Formato Adobe PDF
597.32 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/114775
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 63
  • ???jsp.display-item.citation.isi??? 49
social impact