Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.

Wrapping-Oriented Classification of Web Pages

MECCA, Giansalvatore;
2002-01-01

Abstract

Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.
2002
1581134452
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/9529
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? ND
social impact