Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from Web sites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised -- i.e., fully automatic -- wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the paper stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The paper shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the paper has been implemented in a working prototype. We present some experimental results on known Web sites, and discuss opportunities and limitations of the proposed approach.

Automatic Information Extraction from Large Websites

MECCA, Giansalvatore
2004

Abstract

Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from Web sites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised -- i.e., fully automatic -- wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the paper stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The paper shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the paper has been implemented in a working prototype. We present some experimental results on known Web sites, and discuss opportunities and limitations of the proposed approach.
File in questo prodotto:
File Dimensione Formato  
jacm04-finalACM.pdf

non disponibili

Tipologia: Documento in Post-print
Licenza: DRM non definito
Dimensione 1.12 MB
Formato Adobe PDF
1.12 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11563/1586
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 120
  • ???jsp.display-item.citation.isi??? 74
social impact