Comparing Clustering Algorithms for the Identification of Similar Pages in Web Applications

IRIS

In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.

Comparing Clustering Algorithms for the Identification of Similar Pages in Web Applications

A. DE LUCIA;M. RISI;SCANNIELLO, GIUSEPPE;AND G. TORTORA

2007-01-01

Abstract

In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2007
			
	Codice ISBN
	
				9783540735960
			
	Appare nelle tipologie:
	
				4.1 Contributo in atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
printed paper.pdf solo utenti autorizzati Tipologia: Documento in Post-print Licenza: DRM non definito Dimensione 207.53 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	207.53 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/13752

Citazioni

ND

1

1

social impact