Clustering algorithms play a vital role in various branches of data mining, especially
in image processing, pattern recognition, codebook optimization, etc. Clustering
algorithms can be broadly classified into ‘hierarchical’ and ‘nonhierarchical’
algorithms. Some algorithms that come under the category of ‘nonhierarchical’
clustering are: (i) partitioning algorithms, like k-means of MacQueen (1967), and
k-medoids of Kaufman and Rousseeuw (1990); (ii) density-based algorithms, like
DBSCAN of Ester et al. (1996) and DENCLUE of Hinneburg and Kein (1998); and
(iii) model-based algorithm due to Frawley and Raftery (2002). Most of the partitive
algorithms partition the given data set into different non-overlapping clusters by
computing the distance between the data objects and the centroid of the clusters, and
assigning the data objects to clusters based on the minimum distance. This process
is carried out iteratively till the convergence in the cluster centroids is obtained. The
choice of the distance measure used and the number of clusters required play a crucial
role in determining the overall efficiency of these algorithms. Also, if the clusters
obtained in any iteration are very large or too small, these algorithms are not capable
of splitting or merging the clusters. Thus it will be useful to have algorithms that
partition the data set into non-overlapping clusters and at the same time be able to
split or merge the clusters based on the representation of data objects. Such
algorithms are referred to as hybrid algorithms.
One of the popular hybrid algorithms that makes use of partitive and merging
processes in the formation of clusters is the ISODATA (Iterative Self-Organizing Data
Analysis Technique). This algorithm makes use of six predefined parameters in order
to split or merge the clusters during any iteration. The process of splitting is done by
considering the magnitude of the standard deviation of the attributes of the data objects in the clusters. It is to be noted that the attributes may have the same standard
deviation but different mean values. Therefore, it would be meaningful to consider the
magnitude of the mean as well as the standard deviation of the attributes for splitting
the clusters. One such measure that makes use of the mean as well as standard
deviation is the coefficient of variation, which is defined as the ratio of the standard
deviation to the mean. Thus, it would be prudent to use coefficient of variation instead
of standard deviation to split the clusters. Based on the above observation, in this
paper, a new ISODATA algorithm has been proposed that makes use of the magnitude
of the coefficient of variation as a measure to split the clusters.
|