Architecture Recovery using Latent Semantic Indexing and K-Means: an Empirical Evaluation

Scanniello, Giuseppe; Risi, M.; Tortora, G.

doi:10.1109/SEFM.2010.19

A number of clustering based approaches and tools have been proposed in the past to partition a software system into subsystems. The greater part of these approaches is semiautomatic, thus requiring human decision to identify the best partition of software entities into clusters among the possible partitions. In addition, some approaches are conceived for software systems implemented using a particular programming language (e.g., C and C++). In this paper we present an approach to automate the partitioning of a given software system into subsystems. In particular, the approach first analyzes the software entities (e.g., programs or classes) and then using Latent Semantic Indexing the dissimilarity between these entities is computed. Finally, software entities are grouped using iteratively the k-means clustering algorithm. The approach has been implemented in a prototype of a supporting software system as an Eclipse plug-in. Finally, to assess the approach and the plug-in, we have conducted an empirical investigation on three open source software systems implemented using the programming languages Java and C/C++.