Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva, a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.
Grammars have Exceptions
MECCA, Giansalvatore
1998-01-01
Abstract
Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva, a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its implementation, which relies on a number of ad-hoc techniques for parsing documents, among which an extension of the traditional LL(1) policy based on dynamic tokenization.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.