Web scraping (or wrapping) is a popular means for acquiring data from the web. Recent advancements have made scal- able wrapper-generation possible and enabled data acquisi- tion processes involving thousands of sources. This makes wrapper analysis and maintenance both needed and chal- lenging as no scalable tools exists that support these tasks. We demonstrate WADaR, a scalable and highly auto- mated tool for joint wrapper and data repair. WADaR uses off-the-shelf entity recognisers to locate target entities in wrapper-generated data. Markov chains are used to deter- mine structural repairs, that are then encoded into suitable repairs for both the data and corresponding wrappers. We show that WADaR is able to increase the quality of wrapper-generated relations between 15% and 60%, and to fully repair the corresponding wrapper without any knowl- edge of the original website in more than 50% of the cases. © 2015 VLDB Endowment 2150-8097/15/08.

WADaR: Joint wrapper and data repair

Buoncristiano M.;
2015-01-01

Abstract

Web scraping (or wrapping) is a popular means for acquiring data from the web. Recent advancements have made scal- able wrapper-generation possible and enabled data acquisi- tion processes involving thousands of sources. This makes wrapper analysis and maintenance both needed and chal- lenging as no scalable tools exists that support these tasks. We demonstrate WADaR, a scalable and highly auto- mated tool for joint wrapper and data repair. WADaR uses off-the-shelf entity recognisers to locate target entities in wrapper-generated data. Markov chains are used to deter- mine structural repairs, that are then encoded into suitable repairs for both the data and corresponding wrappers. We show that WADaR is able to increase the quality of wrapper-generated relations between 15% and 60%, and to fully repair the corresponding wrapper without any knowl- edge of the original website in more than 50% of the cases. © 2015 VLDB Endowment 2150-8097/15/08.
2015
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/154066
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? ND
social impact