Using a Competitive Clustering Algorithm to Comprehend Web Applications

De Lucia, A.; Scanniello, Giuseppe; Tortora, G.

doi:10.1109/WSE.2006.19

We propose an approach based on Winner Takes All, a competitive clustering algorithm, to support the comprehension of static and dynamic web applications. The process first computes the distances between the web pages and then identifies similar pages through the Winner Takes All clustering algorithm. Two different instances of the process are presented to identify similar pages at structural and content level, respectively. The first instance encodes the page structure into a string and then uses the Levenshtein algorithm to achieve the distances between pairs of pages. On the other hand, to group similar pages at content level we use the Latent Semantic Indexing to produce the page representations as vectors in the concept space. The Euclidean distance is then computed between the vectors to achieve the distances between the pages to be given as input to the adopted clustering algorithm. A prototype to automate the identification of group of similar pages has been implemented. The approach and the prototype have been assessed in a case study.