main-content

This accessible and classroom-tested textbook/reference presents an introduction to the fundamentals of the emerging and interdisciplinary field of data science. The coverage spans key concepts adopted from statistics and machine learning, useful techniques for graph analysis and parallel programming, and the practical application of data science for such tasks as building recommender systems or performing sentiment analysis. Topics and features: provides numerous practical case studies using real-world data throughout the book; supports understanding through hands-on experience of solving data science problems using Python; describes techniques and tools for statistical analysis, machine learning, graph analysis, and parallel programming; reviews a range of applications of data science, including recommender systems and sentiment analysis of text data; provides supplementary code resources and data at an associated website.

Chapter 1. Introduction to Data Science

You have, no doubt, already experienced data science in several forms. When you are looking for information on the web by using a search engine or asking your mobile phone for directions, you are interacting with data science products. Data science has been behind resolving some of our most common daily tasks for several years.
Laura Igual, Santi Seguí

Chapter 2. Toolboxes for Data Scientists

In this chapter, first we introduce some of the cornerstone tools that data scientists use and then we offer a basic overview of the Python language and its data science ecosystem. Examples of the most common data structures and basic functions that data scientists perform are explained to provide the basis for a better understanding of later chapters.
Laura Igual, Santi Seguí

Chapter 3. Descriptive Statistics

In this chapter, we will become familiar with descriptive statistics that is comprised of concepts, terms, measures, and tools that help to describe, show, and summarize data in a meaningful way. When analyzing data, it is possible to use both descriptive and inferential statistics in order to analyze the results and draw some conclusions. We will discuss basic concepts, terms, and procedures, such as mean, median, variance, correlation, etc., to explore, describe, and summarize a given set of data.
Laura Igual, Santi Seguí

Chapter 4. Statistical Inference

In the previous chapter we have seen how to describe a sample in order to produce potentially interesting hypotheses about its population. Some of the descriptions we have seen are based on graphical representations that are easily interpreted by humans, while others are based on parameters that summarize important properties of the sample distribution. In this chapter we will see how to infer predictions about a population. To this end we will explore the relationship between sample parameters and population parameters and we will propose some methods, both theoretical and computational, to assess the quality of parameter estimates from a sample.
Laura Igual, Santi Seguí

Chapter 5. Supervised Learning

In this chapter, we introduce the basics of classification: a type of supervised machine learning. We also give a brief practical tour of learning theory and good practices for successful use of classifiers in a real case using Python. The chapter starts by introducing the classic machine learning pipeline, defining features, and evaluating the performance of a classifier. After that, the notion of generalization error is needed, which allows us to show learning curves in terms of the number of examples and the complexity of the classifier, and also to define the notion of overfitting. That notion will then allow us to develop a strategy for model selection. Finally, two of the best-known techniques in machine learning are introduced: support vector machines and random forests. These are then applied to the proposed problem of predicting those loans that will not be successfully covered once they have been accepted.
Laura Igual, Santi Seguí

Chapter 6. Regression Analysis

In this chapter, we introduce regression analysis and some of its applications in data science using Python tools. We show how regression analysis allows us to understand the behavior of data better, to predict data values (continuous or discrete), and to find important variables by means of building a model from the data. We present four different regression models: simple linear regression, multiple linear regression, polynomial regression and logistic regression. We also emphasize the properties of sparse models in the selection of variables. We use different Python toolboxes to build and apply regression models with ease. Specific visualization tools from Seaborn allow qualitative evaluation; while tools from the Scikit-learn library make quantitative evaluation easier, computing several validation measures. Depending on our aim, visual inspection of the data, statistical analysis or prediction, we chose one tool or another. Regression models are motivated by three real problems that deal with the following questions. Is the climate really changing? Can we predict the price of a new market, given any of its attributes? How many goals makes a football team the winner or the loser?
Laura Igual, Santi Seguí

Chapter 7. Unsupervised Learning

In this chapter, we address the problem of analyzing a set of inputs/data without labels with the goal of finding “interesting patterns” or structures in the data. This type of problem is sometimes called a knowledge discovery problem. Compared to other machine learning problems such as supervised learning, this is a much more open problem, since in general there is no well-defined metric to use and neither there is any specific kind of patterns that we wish to look for. Within unsupervised machine learning, the most common type of problems is the clustering problem; though other problems such as novelty detection, dimensionality reduction and outlier detection are also part of this area. So here we will discuss different clustering methods, compare their advantages and disadvantages, and discuss measures for evaluating their quality. The chapter finishes with a case study using a real data set that analyzes the expenditure of different countries on education.
Laura Igual, Santi Seguí

Chapter 8. Network Analysis

Network data are currently generated and collected to an increasing extent from different fields. In this chapter, we show how network data analysis allows us to gain insight into the data that would be hard to acquire by other means. We introduce some tools in network analysis and visualization. We present important concepts such as connected components, centrality measures, and ego-networks, as well as community detection. We use a Python toolbox (NetworkX) to build graphs easily and analyze them. We motivate concepts in network analysis by a real problem dealing with a Facebook network dataset and answering a set of questions. For instance: Which is the most representative member of the network in terms of the most “connected”, the most “circulated”, the “closest” or the most “accessible” to the rest of the members?
Laura Igual, Santi Seguí

Chapter 9. Recommender Systems

In this chapter, we will see what are recommender systems , how they work, and how they can be implemented.
Laura Igual, Santi Seguí

Chapter 10. Statistical Natural Language Processing for Sentiment Analysis

In this chapter, we will perform sentiment analysis from text data. Generally, sentiment analysis is performed based on the processing of natural language, the analysis of text and computational linguistics. Although data can come from different data sources, in this chapter we will analyze sentiment in text data, using two particular text data examples: one from film critics, where the text is highly structured and maintains text semantics; and another example coming from social networks, where the text can show a lack of structure and users may use text abbreviations. We will review basic mechanisms required to perform sentiment analysis, including data cleaning, producing a general representation of the text, and performing some statistical inference on the text represented to determine positive and negative sentiments.
Laura Igual, Santi Seguí

Chapter 11. Parallel Computing

In this chapter, we will introduce the parallel capabilities of IPython that, through applying a set of techniques, reduce execution time drastically. In a non-computational example, if one painter would spend T units of time painting a house, having N painters can reduce the total time to T/N units of time. As will be shown, two ways of scaling the computational units can be chosen: multicore or distributed computing. IPython hides the differences between them to the programmer; the same commands can be used in both. The ways of sending tasks to computing units will be introduced with the direct and balanced interfaces. Finally, an example with a database made up of millions of entries will show the advantages of parallelism.
Laura Igual, Santi Seguí