Identifying Similar Pages in Web Applications using a Competitive Clustering Algorithm

IRIS

We present an approach based on Winner Takes All (WTA), a competitive clustering algorithm, to support the comprehension of static and dynamic Web applications during Web application reengineering. This approach adopts a process that first computes the distance between Web pages and then identifies and groups similar pages using the considered clustering algorithm. We present an instance of application of the clustering process to identify similar pages at the structural level. The page structure is encoded into a string of HTML tags and then the distance between Web pages at the structural level is computed using the Levenshtein string edit distance algorithm. A prototype to automate the clustering process has been implemented that can be extended to other instances of the process, such as the identification of groups of similar pages at content level. The approach and the tool have been evaluated in two case studies. The results have shown that the WTA clustering algorithm suggests heuristics to easily identify the best partition of Web pages into clusters among the possible partitions.

Identifying Similar Pages in Web Applications using a Competitive Clustering Algorithm

A. DE LUCIA;SCANNIELLO, GIUSEPPE;G. TORTORA

2009-01-01

Abstract

We present an approach based on Winner Takes All (WTA), a competitive clustering algorithm, to support the comprehension of static and dynamic Web applications during Web application reengineering. This approach adopts a process that first computes the distance between Web pages and then identifies and groups similar pages using the considered clustering algorithm. We present an instance of application of the clustering process to identify similar pages at the structural level. The page structure is encoded into a string of HTML tags and then the distance between Web pages at the structural level is computed using the Levenshtein string edit distance algorithm. A prototype to automate the clustering process has been implemented that can be extended to other instances of the process, such as the identification of groups of similar pages at content level. The approach and the tool have been evaluated in two case studies. The results have shown that the WTA clustering algorithm suggests heuristics to easily identify the best partition of Web pages into clusters among the possible partitions.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

2009

Appare nelle tipologie:

1.1 Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
PrintedPaper.pdf accesso aperto Tipologia: Documento in Post-print Licenza: DRM non definito Dimensione 209.11 kB Formato Adobe PDF Visualizza/Apri	209.11 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/5760

Citazioni

ND

15

10

social impact