We propose an approach to automatically detect duplicated pages in dynamic Web sites. Our approach analyzes both the page structure, implemented by specific sequences of HTML tags, and the displayed content. In addition, for each pair of dynamic pages we also consider the similarity degree of their scripting source code. The similarity degree of two pages is computed using different similarity metrics for the different parts of a web page based on the Levenshtein string edit distance. We have implemented a prototype to automate the clone detection process on web applications developed using JSP technology and used it to validate our approach in a case study.
Identifying Clones in Dynamic Web Sites Using Similarity thresholds
SCANNIELLO, GIUSEPPE;
2004-01-01
Abstract
We propose an approach to automatically detect duplicated pages in dynamic Web sites. Our approach analyzes both the page structure, implemented by specific sequences of HTML tags, and the displayed content. In addition, for each pair of dynamic pages we also consider the similarity degree of their scripting source code. The similarity degree of two pages is computed using different similarity metrics for the different parts of a web page based on the Levenshtein string edit distance. We have implemented a prototype to automate the clone detection process on web applications developed using JSP technology and used it to validate our approach in a case study.File | Dimensione | Formato | |
---|---|---|---|
ICEIS_2004.pdf
solo utenti autorizzati
Tipologia:
Documento in Pre-print
Licenza:
DRM non definito
Dimensione
273.69 kB
Formato
Adobe PDF
|
273.69 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.