main-content

This textbook on practical data analytics unites fundamental principles, algorithms, and data. Algorithms are the keystone of data analytics and the focal point of this textbook. Clear and intuitive explanations of the mathematical and statistical foundations make the algorithms transparent. But practical data analytics requires more than just the foundations. Problems and data are enormously variable and only the most elementary of algorithms can be used without modification. Programming fluency and experience with real and challenging data is indispensable and so the reader is immersed in Python and R and real data analysis. By the end of the book, the reader will have gained the ability to adapt algorithms to new problems and carry out innovative analyses.
This book has three parts:(a) Data Reduction: Begins with the concepts of data reduction, data maps, and information extraction. The second chapter introduces associative statistics, the mathematical foundation of scalable algorithms and distributed computing. Practical aspects of distributed computing is the subject of the Hadoop and MapReduce chapter.(b) Extracting Information from Data: Linear regression and data visualization are the principal topics of Part II. The authors dedicate a chapter to the critical domain of Healthcare Analytics for an extended example of practical data analytics. The algorithms and analytics will be of much interest to practitioners interested in utilizing the large and unwieldly data sets of the Centers for Disease Control and Prevention's Behavioral Risk Factor Surveillance System.(c) Predictive Analytics Two foundational and widely used algorithms, k-nearest neighbors and naive Bayes, are developed in detail. A chapter is dedicated to forecasting. The last chapter focuses on streaming data and uses publicly accessible data streams originating from the Twitter API and the NASDAQ stock market in the tutorials.
This book is intended for a one- or two-semester course in data analytics for upper-division undergraduate and graduate students in mathematics, statistics, and computer science. The prerequisites are kept low, and students with one or two courses in probability or statistics, an exposure to vectors and matrices, and a programming course will have no difficulty. The core material of every chapter is accessible to all with these prerequisites. The chapters often expand at the close with innovations of interest to practitioners of data science. Each chapter includes exercises of varying levels of difficulty. The text is eminently suitable for self-study and an exceptional resource for practitioners.

Chapter 1. Introduction

Abstract
The beginning of the twenty-first century will be remembered for dramatic and rapid technological advances in automation, instrumentation, and the internet. One consequence of these technological developments is the appearance of massively large data sets and data streams. The potential exists for extracting new information and insights from these data. But new ideas and methods are needed to meet the substantial challenges posed by the data. In response, data science has formed from statistics and computer science. Algorithms play a vastly important and uniting role in data analytics and in the narrative of this book. In this chapter, we expand on these topics and provide examples from healthcare, history, and business analytics. We conclude with a short discussion of algorithms, remarks on programming languages, and a brief review of matrix algebra.
Brian Steele, John Chandler, Swarna Reddy

Chapter 2. Data Mapping and Data Dictionaries

Abstract
This chapter delves into the key mathematical and computational components of data analytic algorithms. The purpose of these algorithms is to reduce massively large data sets to much smaller data sets with a minimal loss of relevant information. From the mathematical perspective, a data reduction algorithm is a sequence of data mappings, that is, functions that consume data in the form of sets and output data in a reduced form. The mathematical perspective is important because it imposes certain desirable attributes on the mappings. However, most of our attention is paid to the practical aspects of turning the mappings into code. The mathematical and computational aspects of data mappings are applied through the use of data dictionaries. The tutorials of this chapter help the reader develop familiarity with data mappings and Python dictionaries.
Brian Steele, John Chandler, Swarna Reddy

Chapter 3. Scalable Algorithms and Associative Statistics

Abstract
It’s not uncommon that a single computer is inadequate to handle a massively large data set. The common problems are that it takes too long to process the data and the data volume exceeds the storage capacity of the host. Cleverly designed algorithms sometimes can reduce the processing time to an acceptable point, but the single host solution will eventually fail if data volume is sufficiently great. A far-reaching solution to the data volume problem replaces the single host with a network of computers across which the data are distributed and processed. However, the hardware solution is incomplete until the data processing algorithms are adapted to the distributed computing environment. A complete solution requires algorithms that are scalable. Scalability depends on the statistics that are being computed by the algorithm, and the statistics that allow for scalability are associative statistics. Scalability and associative statistics are the subject of this chapter.
Brian Steele, John Chandler, Swarna Reddy

Abstract
In this chapter we consider situations in which a single host computer is inadequate because the data volume or processing demand exceeds the capacity of the host. A popular solution distributes the data and computations across a network of computers or a short-lived network created for the task (a cluster). In this scenario, each cluster node (a computing unit) stores and processes a subset of the data. The results are merged as one when all nodes have been completed their tasks. For this solution to succeed, the computational algorithm must conform to a certain structure and the cluster execution must be managed. The Hadoop environment and the MapReduce programming design provide the management and algorithmic structure. Hadoop is a collection of software and services that builds the cluster, distributes the data across the cluster, and controls the data processing algorithms and the transmission of results. The MapReduce programming design insures scalability, and scalability insures that the results are independent of the cluster configuration. The reader is guided through an introductory application of Hadoop and MapReduce after a discussion of the essential components.
Brian Steele, John Chandler, Swarna Reddy

Chapter 5. Data Visualization

Abstract
A visual is successful when the information encoded in the data is efficiently transmitted to an audience. Data visualization is the discipline dedicated to the principles and methods of translating data to visual form. In this chapter we discuss the principles that produce successful visualizations. The second section illustrates the principles through examples of best and worst practices. In the final section, we navigate through the construction of our best-example graphics.
Brian Steele, John Chandler, Swarna Reddy

Chapter 6. Linear Regression Methods

Abstract
Linear regression is a broad and well-developed area of statistics. If there is a core to statistical methodology, then linear regression is it. The ubiquity of linear regression methods in statistics and data analytics stems from the ease with which one may fit tractable models that describe the primary features of a process or population. Not only is linear regression useful for description, it’s also very useful for prediction since the models often provide good approximations of complex relationships. In the field of statistics, hypothesis testing and confidence intervals are routinely used in linear regression analyses. The extension of these methods to data science is often unsuccessful because of the prevalence of opportunistically collected data. Most of the time, opportunistically collected data cannot support inferential methods because the quality of the inferences produced by the methods is unknown. We discuss inference herein so that the reader may understand the potential for success and for failure of these methods. However, the focus is on the essential and most useful aspects of the subject matter for data analytics—the fitted models. The topic of linear regression provides an avenue to gain experience with the statistical package R, one of the most popular software packages used by data scientists.
Brian Steele, John Chandler, Swarna Reddy

Chapter 7. Healthcare Analytics

Abstract
Healthcare analytics refers to data analytic methods applied in the healthcare domain. Healthcare analytics is becoming a prominent data science domain because of the societal and economic burden of disease and the opportunities to better understand the healthcare system through the analysis of data. This chapter introduces the reader to the domain through the analysis of diabetes prevalence and incidence. The data are drawn from the Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System.
Brian Steele, John Chandler, Swarna Reddy

Chapter 8. Cluster Analysis

Abstract
Sometimes it’s possible to divide a collection of observations into distinct subgroups based on nothing more than the observation attributes. If this can be done, then understanding the population or process generating the observations becomes easier. The intent of cluster analysis is to carry out a division of a data set into clusters of observations that are more alike within cluster than between clusters. Clusters are formed either by aggregating observations or dividing a single glob of observations into a collection of smaller sets. The process of cluster formation involves two varieties of algorithms. The first shuffles observations between a fixed number of clusters to maximize within-cluster similarity. The second process begins with singleton clusters and recursively merges the clusters. Alternatively, we may begin with one cluster and recursively split off new clusters. In this chapter, we discuss two popular cluster analysis algorithms (and representatives of the two varieties of algorithms): the k-means algorithm and hierarchical agglomerative clustering.
Brian Steele, John Chandler, Swarna Reddy

Chapter 9. k-Nearest Neighbor Prediction Functions

Abstract
The purpose of the k-nearest neighbor prediction function is to predict a target variable from a predictor vector. Commonly, the target is a categorical variable, a label identifying the group from which the observation was drawn. The analyst has no knowledge of the membership label but does have the information coded in the attributes of the predictor vector. The predictor vector and the k-nearest neighbor prediction function generate a prediction of membership. In addition to qualitative attributes, the k-nearest neighbor prediction function may be used to predict quantitative target variables. The k-nearest-neighbor prediction functions are conceptually and computationally simple and often rival far more sophisticated prediction functions with respect to accuracy. The functions are nonparametric in the sense that the mathematical basis supporting the prediction functions is not a model. Instead the k-nearest neighbor prediction function utilizes a set of training observations on target and predictor vector pairs and, in essence examines the target values of the training observations nearest to the target. If the target variable is a group membership label, the target is predicted to be to the most common label among the nearest neighbors. If the target is quantitative, then the prediction is an average of the target values associated with the nearest neighbors.
Brian Steele, John Chandler, Swarna Reddy

Chapter 10. The Multinomial Naïve Bayes Prediction Function

Abstract
The naïve Bayes prediction function is a computationally and conceptually simple algorithm. While the performance of the algorithm generally is not best among competitors when the predictor variables are quantitative, it does well with categorical predictor variables, and it’s especially well-suited for categorical predictor variables with many categories. In this chapter we develop the multinomial naïve Bayes prediction function, the incarnation of naïve Bayes for categorical predictors. We develop the function from its mathematical foundation before applying it to two very different problems: predicting the authorship of the Federalist Papers and a problem from the business marketing domain—classifying shoppers based on their grocery store purchases. The Federalist Papers application provides the opportunity to work with textual data.
Brian Steele, John Chandler, Swarna Reddy

Chapter 11. Forecasting

Abstract
This chapter provides an introduction to time series and foundational algorithms related to and for forecasting. We adopt a pragmatic, first-order approach aimed at capturing the dominant attributes of the time series useful for prediction. Two forecasting methods are developed: Holt-Winters exponential forecasting and linear regression with time-varying coefficients. The first two tutorials, using complaints received by the U.S. Consumer Financial Protection Bureau, instruct the reader on processing data with time attributes and computing autocorrelation coefficients. The following tutorials guide the reader through forecasting using economic and stock price series.
Brian Steele, John Chandler, Swarna Reddy

Chapter 12. Real-time Analytics

Abstract
Streaming data are data transmitted by a source to a host computer immediately after being produced. The intent of real-time data analytics is to analyze the data as they are received and at a rate sufficiently fast to keep up with the stream. Analyses of streaming data center about characterizing the current level of the data-generating process, forecasting future values, and determining whether the process is undergoing unexpected change. In this chapter, we focus on computational aspects of real-time analytics. Methods were discussed in Chap. 11 The tutorials guide the reader through the analysis of data streams originating from two public and very different sources: tweets originating from the Twitter API and stock quotations originating from the NASDAQ.
Brian Steele, John Chandler, Swarna Reddy