Comparing incomplete database instances is crucial in various applications, including dataset evolution, evaluating data cleaning solutions, and comparing instances from data exchange systems. We present a framework designed to compute similarity among instances containing labeled null values, even in the absence of primary keys. A notable outcome of our approach is the generation of a mapping between tuples in the instances, which explains the similarity score. Computing the similarity of two incomplete instances is NP-hard in the instance size. To compare instances of realistic size we present an approximate PTIME algorithm that approximates the exact score with an error always smaller than 1% but it significantly speedup the computation up to three orders of magnitude w.r.t. the exact algorithm.
Comparing Incomplete Database Instances
Mecca G.;Santoro D.;Veltri E.
2024-01-01
Abstract
Comparing incomplete database instances is crucial in various applications, including dataset evolution, evaluating data cleaning solutions, and comparing instances from data exchange systems. We present a framework designed to compute similarity among instances containing labeled null values, even in the absence of primary keys. A notable outcome of our approach is the generation of a mapping between tuples in the instances, which explains the similarity score. Computing the similarity of two incomplete instances is NP-hard in the instance size. To compare instances of realistic size we present an approximate PTIME algorithm that approximates the exact score with an error always smaller than 1% but it significantly speedup the computation up to three orders of magnitude w.r.t. the exact algorithm.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.