Skip to main content
main-content
Top

About this book

This text examines the goals of data analysis with respect to enhancing knowledge, and identifies data summarization and correlation analysis as the core issues. Data summarization, both quantitative and categorical, is treated within the encoder-decoder paradigm bringing forward a number of mathematically supported insights into the methods and relations between them. Two Chapters describe methods for categorical summarization: partitioning, divisive clustering and separate cluster finding and another explain the methods for quantitative summarization, Principal Component Analysis and PageRank.

Features:

· An in-depth presentation of K-means partitioning including a corresponding Pythagorean decomposition of the data scatter.

· Advice regarding such issues as clustering of categorical and mixed scale data, similarity and network data, interpretation aids, anomalous clusters, the number of clusters, etc.

· Thorough attention to data-driven modelling including a number of mathematically stated relations between statistical and geometrical concepts including those between goodness-of-fit criteria for decision trees and data standardization, similarity and consensus clustering, modularity clustering and uniform partitioning.

New edition highlights:

· Inclusion of ranking issues such as Google PageRank, linear stratification and tied rankings median, consensus clustering, semi-average clustering, one-cluster clustering

· Restructured to make the logics more straightforward and sections self-contained

Core Data Analysis: Summarization, Correlation and Visualization is aimed at those who are eager to participate in developing the field as well as appealing to novices and practitioners.

Table of Contents

Chapter 1. Topics in Substance of Data Analysis

Abstract
This is an introductory chapter in which
(i)
The goals of core data analysis as a tool helping to enhance and augment knowledge of the domain are outlined. Since knowledge is represented by the concepts and statements of relation between them, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations.
 
(ii)
A set of eight cases involving small datasets and related data analysis problems is presented. The datasets are taken from various fields such as monitoring market towns, computer security protocols, bioinformatics, and cognitive psychology.
 
(iii)
An overview of data visualization, its goals and some techniques, is given.
 
(iv)
A general view of strengths and pitfalls of data analysis is provided.
 
(v)
An overview of the concept of classification as a soft knowledge structure widely used in theory and practice is given.
 
Boris Mirkin

Chapter 2. Quantitative Summarization

Abstract
Before going to the thick of the multivariate summarization, this chapter first considers the concept of feature and its summarizations into histograms, density functions and centers. Two perspectives are defined, the probabilistic and vector-space ones, for defining concepts of feature centers and spreads. Also, current views on the types of measurement scales are described to conclude that the binary scales are both quantitative and categorical. The core of the Chapter describes the method of principal components (PCA) as a method for fitting a data-driven data summarization model. The model proposes that the data entries, up to the errors, are (sums of) products of hidden factor scores and feature loadings. This, together with the least-squares fitting criterion, appears to be equivalent to finding what is known in mathematics as part of the singular value decomposition (SVD) of a rectangular matrix. Three applications of the method are described: (1) scoring hidden aggregate factors, (2) visualization of the data, and (3) Latent Semantic Indexing. The conventional, and equivalent, formulation of PCA via covariance matrices involving their eigenvalues is also described. The main difference between the two formulations is that the property of principal components to be linear combinations of features is postulated in the conventional approach and derived in that SVD based. The issue of interpretation of the results is discussed, too. A novel promising approach based on a postulated linear model of stratification is presented via a project. The issue of data standardization in data summarization problems, remaining unsolved, is discussed at length in the beginning. A powerful application using eigenvectors for scoring node importance in networks and pair comparison matrices, the Google PageRank approach, is described too.
Boris Mirkin

Chapter 3. Learning Correlations

Abstract
After a short introduction of the general concept of decision rule to relate input and target features, this chapter describes some generic and most popular methods for learning correlations over two or more features. Four of them pertain to quantitative targets (linear regression, canonical correlation, neural network, and regression tree), and seven to categorical ones (linear discrimination, support vector machine, naïve Bayes classifier, classification tree, contingency table, distance between partition and ranking relations, and the correspondence analysis). Of these, classification trees are treated in a most detailed way including a number of theoretical results that are not well known. These establish firm relations between popular scoring functions and bivariate measures—Quetelet indexes in contingency tables and, rather unexpectedly, normalization options for dummy variables representing target categories. Some related concepts such as Bayesian decision rules, bag-of-word model in text analysis, VC-dimension and kernel for non-linear classification are introduced too. The Chapter outlines several important characteristics of summarization and correlation between two features, and displays some of the properties of those. They are:
  • linear regression and correlation coefficient for two quantitative variables (Sect. 3.2);
  • tabular regression and correlation ratio for the mixed scale case (Sect. 3.8.3); and
  • contingency table, Quetelet index, statistical independence, and Pearson’s chi-squared for two nominal variables; the latter is treated as a summary correlation measure, in contrast to the conventional view of it as just a criterion of statistical independence (Sect. 3.6.1); moreover, a few less known least-squares based concepts are outlined, including canonical correlation and correspondence analysis.
Boris Mirkin

Chapter 4. Core Partitioning: K-means and Similarity Clustering

Abstract
K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters. The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are much important when using K-means for real-world data analysis:
  • Data standardization, especially, at nominal or mixed scales
  • Innate and other tools for interpretation of clusters
  • Analysis of examples of K-means working and its failures
  • Initialization—the choice of the number of clusters and location of centers.
Versions of K-means such as incremental K-means, nature inspired K-means, and entity-center “medoid” methods are presented. Three modifications of K-means onto different cluster structures are given: Fuzzy K-means for finding fuzzy clusters, Expectation-Maximization (EM) for finding probabilistic clusters, and Kohonen’s self-organizing maps (SOM) that tie up the sought clusters to a visually convenient two-dimensional grid. An equivalent reformulation of K-means criterion is described to yield what we call the complementary criterion. This criterion allows to reinterpret the method as that for finding big anomalous clusters. In this formulation, K-means is shown to extend the Principal component analysis criterion to the case at which the scoring factors are supposed to be binary. This allows to address a haunting issue at K-means, finding the “right” number of clusters K, by one-by-one building Anomalous clusters. Section 4.6 is devoted to partitioning over similarity data. First of all, the complementary K-means criterion is equivalently reformulated as the so-called semi-average similarity criterion. This criterion is maximized with a consecutive merger process referred to as SA-Agglomeration clustering to produce provably tight, on average, clusters. This method stops merging clusters when the criterion does not increase anymore if the data has been pre-processed by zeroing the similarities of the objects to themselves. A similar process is considered for another natural criterion, the summary within-cluster similarity, for which two pre-processing options are considered. These are: a popular “modularity” clustering option, based on subtraction of random interactions, and “uniform” partitioning, based on a scale shift, a.k.a. soft thresholding. Using either pre-processing option, the summary clustering also leads to an automated determination of the number of clusters. The chapter concludes with Sect. 4.7 on consensus clustering, a more recent concept. In the context of central partition for a given ensemble of partitions, two distance-between-partitions measures apply, both involving the so-called consensus matrix. The consensus similarity is defined, for any two objects, by the number of clusters in the ensemble to which both objects belong. This brings the issue of consensus into the context of similarity clustering, in the form of either the semi-average criterion or uniform partitioning criterion.
Boris Mirkin

Chapter 5. Divisive and Separate Cluster Structures

Abstract
This Chapter is about dividing a dataset or its subset in two parts. If both parts are to be clusters, this is referred to as divisive clustering. If just one part is to be a cluster, this will be referred to as separative clustering. Iterative application of divisive clustering builds a binary hierarchy of which we will be interested at a partition of the dataset. Iterative application of separative clustering builds a set of clusters, possibly overlapping. The first three sections introduce three different approaches in divisive clustering: Ward clustering, Spectral clustering and Single link clustering. Ward clustering is an extension of K-means clustering dominated by the so-called Ward distance between clusters; also, this is a natural niche for conceptual clustering in which every division is made over a single feature to attain immediate interpretability of the hierarchy branches and clusters. Spectral clustering gained popularity with the so-called Normalized Cut approach to divisive clustering. A relaxation of this combinatorial problem appears to be equivalent to optimizing the Rayleigh quotient for a Laplacian transformation of the similarity matrix under consideration. In fact, other approaches under consideration, such as uniform clustering and semi-average clustering, also may be treated within the spectral approach. Single link clustering formalizes the nearest neighbor approach and is much related to graph-theoretic concepts: components and maximum spanning trees. One may think of divisive clustering as a process for building a binary hierarchy, which goes “top-down” in contrast to agglomerative clustering (in Sect. 4.​6), which builds a binary hierarchy “bottom-up”. Two remaining sections describe two separative clustering approaches as extensions of popular approaches to the case. One tries to find a cluster with maximum inner summary similarity at a similarity matrix preprocessed according to the uniform and modularity approaches considered in Sect. 4.​6.​3 The other applies the encoder-decoder least-squares approach to modeling data by a one-cluster structure. It appears, criteria emerging within the latter approach are much akin to those described earlier, the summary and semi-average similarities, although parameters now can be adjusted according to the least-squares approach. This applies to a distinct direction, the so-called additive clustering approach, which can be usefully applied to the analysis of similarity data.
Boris Mirkin
Additional information