Comparing Incomplete Database Instances

Glavic, B.; Mecca, G.; Miller, R. J.; Papotti, P.; Santoro, D.; Veltri, E.

Comparing incomplete database instances is crucial in various applications, including dataset evolution, evaluating data cleaning solutions, and comparing instances from data exchange systems. We present a framework designed to compute similarity among instances containing labeled null values, even in the absence of primary keys. A notable outcome of our approach is the generation of a mapping between tuples in the instances, which explains the similarity score. Computing the similarity of two incomplete instances is NP-hard in the instance size. To compare instances of realistic size we present an approximate PTIME algorithm that approximates the exact score with an error always smaller than 1% but it significantly speedup the computation up to three orders of magnitude w.r.t. the exact algorithm.

Comparing Incomplete Database Instances

Glavic B.;Mecca G.;Miller R. J.;Papotti P.;Santoro D.;Veltri E.

2024-01-01

Abstract

Comparing incomplete database instances is crucial in various applications, including dataset evolution, evaluating data cleaning solutions, and comparing instances from data exchange systems. We present a framework designed to compute similarity among instances containing labeled null values, even in the absence of primary keys. A notable outcome of our approach is the generation of a mapping between tuples in the instances, which explains the similarity score. Computing the similarity of two incomplete instances is NP-hard in the instance size. To compare instances of realistic size we present an approximate PTIME algorithm that approximates the exact score with an error always smaller than 1% but it significantly speedup the computation up to three orders of magnitude w.r.t. the exact algorithm.