Practical Statistics for Data Scientists

Practical Statistics for Data Scientists" />

Read it now on the O’Reilly learning platform with a 10-day free trial.

O’Reilly members get unlimited access to books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Book description

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R programming language, and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll learn:

Why exploratory data analysis is a key preliminary step in data science
How random sampling can reduce bias and yield a higher quality dataset, even with big data
How the principles of experimental design yield definitive answers to questions
How to use regression to estimate outcomes and detect anomalies
Key classification techniques for predicting which categories a record belongs to
Statistical machine learning methods that “learn” from data
Unsupervised learning methods for extracting meaning from unlabeled data

Show and hide more

Publisher resources

Table of contents Product information

Preface
1. Conventions Used in This Book
2. Using Code Examples
3. Safari® Books Online
4. How to Contact Us
5. Acknowledgments
1. Elements of Structured Data
  1. Further Reading
  1. Data Frames and Indexes
  2. Nonrectangular Data Structures
  3. Further Reading
  1. Mean
  2. Median and Robust Estimates
  3. Example: Location Estimates of Population and Murder Rates
  4. Further Reading
  1. Standard Deviation and Related Estimates
  2. Estimates Based on Percentiles
  3. Example: Variability Estimates of State Population
  4. Further Reading
  1. Percentiles and Boxplots
  2. Frequency Table and Histograms
  3. Density Estimates
  4. Further Reading
  1. Mode
  2. Expected Value
  3. Further Reading
  1. Scatterplots
  2. Further Reading
  1. Hexagonal Binning and Contours (Plotting Numeric versus Numeric Data)
  2. Two Categorical Variables
  3. Categorical and Numeric Data
  4. Visualizing Multiple Variables
  5. Further Reading
  1. Random Sampling and Sample Bias
    1. Bias
    2. Random Selection
    3. Size versus Quality: When Does Size Matter?
    4. Sample Mean versus Population Mean
    5. Further Reading
    1. Regression to the Mean
    2. Further Reading
    1. Central Limit Theorem
    2. Standard Error
    3. Further Reading
    1. Resampling versus Bootstrapping
    2. Further Reading
    1. Further Reading
    1. Standard Normal and QQ-Plots
    1. Further Reading
    1. Further Reading
    1. Further Reading
    1. Poisson Distributions
    2. Exponential Distribution
    3. Estimating the Failure Rate
    4. Weibull Distribution
    5. Further Reading
    1. A/B Testing
      1. Why Have a Control Group?
      2. Why Just A/B? Why Not C, D…?
      3. For Further Reading
      1. The Null Hypothesis
      2. Alternative Hypothesis
      3. One-Way, Two-Way Hypothesis Test
      4. Further Reading
      1. Permutation Test
      2. Example: Web Stickiness
      3. Exhaustive and Bootstrap Permutation Test
      4. Permutation Tests: The Bottom Line for Data Science
      5. For Further Reading
      1. P-Value
      2. Alpha
      3. Type 1 and Type 2 Errors
      4. Data Science and P-Values
      5. Further Reading
      1. Further Reading
      1. Further Reading
      1. Further Reading
      1. F-Statistic
      2. Two-Way ANOVA
      3. Further Reading
      1. Chi-Square Test: A Resampling Approach
      2. Chi-Square Test: Statistical Theory
      3. Fisher’s Exact Test
      4. Relevance for Data Science
      5. Further Reading
      1. Further Reading
      1. Sample Size
      2. Further Reading
      1. Simple Linear Regression
        
        The Regression Equation
        
        Fitted Values and Residuals
        
        Least Squares
        
        Prediction versus Explanation (Profiling)
        
        Further Reading
        
        Example: King County Housing Data
        
        Assessing the Model
        
        Cross-Validation
        
        Model Selection and Stepwise Regression
        
        Weighted Regression
        
        Further Reading
        
        The Dangers of Extrapolation
        
        Confidence and Prediction Intervals
        
        Dummy Variables Representation
        
        Factor Variables with Many Levels
        
        Ordered Factor Variables
        
        Correlated Predictors
        
        Multicollinearity
        
        Confounding Variables
        
        Interactions and Main Effects
        
        Outliers
        
        Influential Values
        
        Heteroskedasticity, Non-Normality and Correlated Errors
        
        Partial Residual Plots and Nonlinearity
        
        Polynomial
        
        Splines
        
        Generalized Additive Models
        
        Further Reading
        
        Naive Bayes
        
        Why Exact Bayesian Classification Is Impractical
        
        The Naive Solution
        
        Numeric Predictor Variables
        
        Further Reading
        
        Covariance Matrix
        
        Fisher’s Linear Discriminant
        
        A Simple Example
        
        Further Reading
        
        Logistic Response Function and Logit
        
        Logistic Regression and the GLM
        
        Generalized Linear Models
        
        Predicted Values from Logistic Regression
        
        Interpreting the Coefficients and Odds Ratios
        
        Linear and Logistic Regression: Similarities and Differences
        
        Assessing the Model
        
        Further Reading
        
        Confusion Matrix
        
        The Rare Class Problem
        
        Precision, Recall, and Specificity
        
        ROC Curve
        
        AUC
        
        Lift
        
        Further Reading
        
        Undersampling
        
        Oversampling and Up/Down Weighting
        
        Data Generation
        
        Cost-Based Classification
        
        Exploring the Predictions
        
        Further Reading
        
        K-Nearest Neighbors
        
        A Small Example: Predicting Loan Default
        
        Distance Metrics
        
        One Hot Encoder
        
        Standardization (Normalization, Z-Scores)
        
        Choosing K
        
        KNN as a Feature Engine
        
        A Simple Example
        
        The Recursive Partitioning Algorithm
        
        Measuring Homogeneity or Impurity
        
        Stopping the Tree from Growing
        
        Predicting a Continuous Value
        
        How Trees Are Used
        
        Further Reading
        
        Bagging
        
        Random Forest
        
        Variable Importance
        
        Hyperparameters
        
        The Boosting Algorithm
        
        XGBoost
        
        Regularization: Avoiding Overfitting
        
        Hyperparameters and Cross-Validation
        
        Principal Components Analysis
        
        A Simple Example
        
        Computing the Principal Components
        
        Interpreting Principal Components
        
        Further Reading
        
        A Simple Example
        
        K-Means Algorithm
        
        Interpreting the Clusters
        
        Selecting the Number of Clusters
        
        A Simple Example
        
        The Dendrogram
        
        The Agglomerative Algorithm
        
        Measures of Dissimilarity
        
        Multivariate Normal Distribution
        
        Mixtures of Normals
        
        Selecting the Number of Clusters
        
        Further Reading
        
        Scaling the Variables
        
        Dominant Variables
        
        Categorical Data and Gower’s Distance
        
        Problems with Clustering Mixed Data
        
        Show and hide more
        Product information
        
        Title: Practical Statistics for Data Scientists
        
        Author(s): Peter Bruce, Andrew Bruce
        
        Release date: May 2017
        
        Publisher(s): O'Reilly Media, Inc.
        
        ISBN: 9781491952962
        
        You might also like
        
        Check it out now on O’Reilly
        
        Dive in for free with a 10-day trial of the O’Reilly learning platform—then explore all the other resources our members count on to build skills and solve problems every day.

Practical Statistics for Data Scientists

Book description

Publisher resources

Table of contents

Product information

You might also like

Check it out now on O’Reilly