Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a Web page, and reorganizes them in a more structured format---for example, in an XML document. Several researches have proposed solutions to ease the burden of writing wrappers. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a data-intensive Web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique, that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life Web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching.

RoadRunner: Automatic Data Extraction from Data-Intensive Web Sites

MECCA, Giansalvatore;
2002

Abstract

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a Web page, and reorganizes them in a more structured format---for example, in an XML document. Several researches have proposed solutions to ease the burden of writing wrappers. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a data-intensive Web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique, that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life Web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11563/9625
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 45
  • ???jsp.display-item.citation.isi??? ND
social impact