In this paper, we analyze some clustering algorithms that have been widely employed in the past to support the comprehension of web applications. To this end, we have defined an approach to identify static pages that are duplicated or cloned at the content level. This approach is based on a process that first computes the dissimilarity between web pages using Latent Semantic Indexing, a well known information retrieval technique, and then groups similar pages using clustering algorithms. We consider five instances of this process, each based on three variants of the agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a widely employed partitional competitive clustering algorithm, namely Winner Takes All. In order to assess the proposed approach, we have used the static pages of three web applications and one static web site.

Clustering Algorithms and Latent Semantic Indexing to Identify Similar Pages in Web Applications

SCANNIELLO, GIUSEPPE;
2007-01-01

Abstract

In this paper, we analyze some clustering algorithms that have been widely employed in the past to support the comprehension of web applications. To this end, we have defined an approach to identify static pages that are duplicated or cloned at the content level. This approach is based on a process that first computes the dissimilarity between web pages using Latent Semantic Indexing, a well known information retrieval technique, and then groups similar pages using clustering algorithms. We consider five instances of this process, each based on three variants of the agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a widely employed partitional competitive clustering algorithm, namely Winner Takes All. In order to assess the proposed approach, we have used the static pages of three web applications and one static web site.
2007
9781424414505
File in questo prodotto:
File Dimensione Formato  
printed paper.pdf

solo utenti autorizzati

Tipologia: Documento in Pre-print
Licenza: DRM non definito
Dimensione 791.94 kB
Formato Adobe PDF
791.94 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11563/13688
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? 7
social impact