RoadRunner: Automatic Data Extraction from Data-Intensive Web Sites

Crescenzi, V; Mecca, Giansalvatore; Merialdo, P.

doi:10.1145/564691.564778

Data extraction from HTML pages is performed by software modules, usually called wrappers. Roughly speaking, a wrapper identifies and extracts relevant pieces of text inside a Web page, and reorganizes them in a more structured format---for example, in an XML document. Several researches have proposed solutions to ease the burden of writing wrappers. In the literature there is a number of systems to (semi-)automatically generate wrappers for HTML pages. We have recently investigated for original approaches that aims at pushing further the level of automation of the wrapper generation process. Our main intuition is that, in a data-intensive Web site, pages can be classified in a small number of classes, such that pages belonging to the same class share a rather tight structure. Based on this observation, we have studied an novel technique, we call the matching technique, that automatically generates a common wrapper by exploiting similarities and differences among pages of the same class. In addition, in order to deal with the complexity and the heterogeneities of real-life Web sites, we have also studied several complementary techniques that greatly enhance the effectiveness of matching.