2019  OriginalPaper  Chapter
4. Core Partitioning: Kmeans and Similarity Clustering
Abstract
Kmeans is arguably the most popular clusteranalysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters. The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using Kmeans for realworld data analysis:

Data standardization, especially, at nominal or mixed scales

Innate and other tools for interpretation of clusters

Analysis of examples of Kmeans working and its failures

Initialization—the choice of the number of clusters and location of centers.
Versions of Kmeans such as incremental Kmeans, nature inspired Kmeans, and entitycenter “medoid” methods are presented. Three modifications of Kmeans onto different cluster structures are given: Fuzzy Kmeans for finding fuzzy clusters, ExpectationMaximization (EM) for finding probabilistic clusters, and Kohonen’s selforganizing maps (SOM) that tie up the sought clusters to a visually convenient twodimensional grid. An equivalent reformulation of Kmeans criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, Kmeans is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at Kmeans, finding the “right” number of clusters K, by onebyone building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary Kmeans criterion is equivalently reformulated as the socalled semiaverage similarity criterion. This criterion is maximized with a consecutive merger process referred to as SAAgglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been preprocessed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary withincluster similarity, for which two preprocessing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either preprocessing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distancebetweenpartitions measures apply, both involving the socalled consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semiaverage criterion or uniform partitioning criterion.