Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sam- ple pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing project RoadRunner, we have developed a prototype, called La- beller, that automatically annotates data extracted by au- tomatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper gen- erator, its underlying approach has a general validity and therefore it can be applied together with other wrapper gen- erator systems. We have experimented the prototype over several real-life web sites obtaining encouraging results.

Automatic Annotation of Data Extracted from Large Web Sites

MECCA, Giansalvatore;
2003-01-01

Abstract

Data extraction from web pages is performed by software modules called wrappers. Recently, some systems for the automatic generation of wrappers have been proposed in the literature. These systems are based on unsupervised inference techniques: taking as input a small set of sam- ple pages, they can produce a common wrapper to extract relevant data. However, due to the automatic nature of the approach, the data extracted by these wrappers have anonymous names. In the framework of our ongoing project RoadRunner, we have developed a prototype, called La- beller, that automatically annotates data extracted by au- tomatically generated wrappers. Although Labeller has been developed as a companion system to our wrapper gen- erator, its underlying approach has a general validity and therefore it can be applied together with other wrapper gen- erator systems. We have experimented the prototype over several real-life web sites obtaining encouraging results.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/9528
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact