Practical Statistics for Data Scientists" />
Read it now on the O’Reilly learning platform with a 10-day free trial.
O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.
Book description
Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.
Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.
With this book, you’ll learn:
- Why exploratory data analysis is a key preliminary step in data science
- How random sampling can reduce bias and yield a higher quality dataset, even with big data
- How the principles of experimental design yield definitive answers to questions
- How to use regression to estimate outcomes and detect anomalies
- Key classification techniques for predicting which categories a record belongs to
- Statistical machine learning methods that “learn” from data
- Unsupervised learning methods for extracting meaning from unlabeled data
Show and hide more
Publisher resources
Table of contents Product information
- Preface
- Conventions Used in This Book
- Using Code Examples
- Safari® Books Online
- How to Contact Us
- Acknowledgments
- Elements of Structured Data
- Further Reading
- Data Frames and Indexes
- Nonrectangular Data Structures
- Further Reading
- Mean
- Median and Robust Estimates
- Example: Location Estimates of Population and Murder Rates
- Further Reading
- Standard Deviation and Related Estimates
- Estimates Based on Percentiles
- Example: Variability Estimates of State Population
- Further Reading
- Percentiles and Boxplots
- Frequency Table and Histograms
- Density Estimates
- Further Reading
- Mode
- Expected Value
- Further Reading
- Scatterplots
- Further Reading
- Hexagonal Binning and Contours (Plotting Numeric versus Numeric Data)
- Two Categorical Variables
- Categorical and Numeric Data
- Visualizing Multiple Variables
- Further Reading
- Random Sampling and Sample Bias
- Bias
- Random Selection
- Size versus Quality: When Does Size Matter?
- Sample Mean versus Population Mean
- Further Reading
- Regression to the Mean
- Further Reading
- Central Limit Theorem
- Standard Error
- Further Reading
- Resampling versus Bootstrapping
- Further Reading
- Further Reading
- Standard Normal and QQ-Plots
- Further Reading
- Further Reading
- Further Reading
- Poisson Distributions
- Exponential Distribution
- Estimating the Failure Rate
- Weibull Distribution
- Further Reading
- A/B Testing
- Why Have a Control Group?
- Why Just A/B? Why Not C, D…?
- For Further Reading
- The Null Hypothesis
- Alternative Hypothesis
- One-Way, Two-Way Hypothesis Test
- Further Reading
- Permutation Test
- Example: Web Stickiness
- Exhaustive and Bootstrap Permutation Test
- Permutation Tests: The Bottom Line for Data Science
- For Further Reading
- P-Value
- Alpha
- Type 1 and Type 2 Errors
- Data Science and P-Values
- Further Reading
- Further Reading
- Further Reading
- Further Reading
- F-Statistic
- Two-Way ANOVA
- Further Reading
- Chi-Square Test: A Resampling Approach
- Chi-Square Test: Statistical Theory
- Fisher’s Exact Test
- Relevance for Data Science
- Further Reading
- Further Reading
- Sample Size
- Further Reading
- Simple Linear Regression
- The Regression Equation
- Fitted Values and Residuals
- Least Squares
- Prediction versus Explanation (Profiling)
- Further Reading
- Example: King County Housing Data
- Assessing the Model
- Cross-Validation
- Model Selection and Stepwise Regression
- Weighted Regression
- Further Reading
- The Dangers of Extrapolation
- Confidence and Prediction Intervals
- Dummy Variables Representation
- Factor Variables with Many Levels
- Ordered Factor Variables
- Correlated Predictors
- Multicollinearity
- Confounding Variables
- Interactions and Main Effects
- Outliers
- Influential Values
- Heteroskedasticity, Non-Normality and Correlated Errors
- Partial Residual Plots and Nonlinearity
- Polynomial
- Splines
- Generalized Additive Models
- Further Reading
- Naive Bayes
- Why Exact Bayesian Classification Is Impractical
- The Naive Solution
- Numeric Predictor Variables
- Further Reading
- Covariance Matrix
- Fisher’s Linear Discriminant
- A Simple Example
- Further Reading
- Logistic Response Function and Logit
- Logistic Regression and the GLM
- Generalized Linear Models
- Predicted Values from Logistic Regression
- Interpreting the Coefficients and Odds Ratios
- Linear and Logistic Regression: Similarities and Differences
- Assessing the Model
- Further Reading
- Confusion Matrix
- The Rare Class Problem
- Precision, Recall, and Specificity
- ROC Curve
- AUC
- Lift
- Further Reading
- Undersampling
- Oversampling and Up/Down Weighting
- Data Generation
- Cost-Based Classification
- Exploring the Predictions
- Further Reading
- K-Nearest Neighbors
- A Small Example: Predicting Loan Default
- Distance Metrics
- One Hot Encoder
- Standardization (Normalization, Z-Scores)
- Choosing K
- KNN as a Feature Engine
- A Simple Example
- The Recursive Partitioning Algorithm
- Measuring Homogeneity or Impurity
- Stopping the Tree from Growing
- Predicting a Continuous Value
- How Trees Are Used
- Further Reading
- Bagging
- Random Forest
- Variable Importance
- Hyperparameters
- The Boosting Algorithm
- XGBoost
- Regularization: Avoiding Overfitting
- Hyperparameters and Cross-Validation
- Principal Components Analysis
- A Simple Example
- Computing the Principal Components
- Interpreting Principal Components
- Further Reading
- A Simple Example
- K-Means Algorithm
- Interpreting the Clusters
- Selecting the Number of Clusters
- A Simple Example
- The Dendrogram
- The Agglomerative Algorithm
- Measures of Dissimilarity
- Multivariate Normal Distribution
- Mixtures of Normals
- Selecting the Number of Clusters
- Further Reading
- Scaling the Variables
- Dominant Variables
- Categorical Data and Gower’s Distance
- Problems with Clustering Mixed Data
Show and hide more
Product information
- Title: Practical Statistics for Data Scientists
- Author(s): Peter Bruce, Andrew Bruce
- Release date: May 2017
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491952962
You might also like
Check it out now on O’Reilly
Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.