Today's recipe: let's take a look at a few common pitfalls of regular tabular ML and statistical models by understanding the difference between Causation and Correlation, and try it in action using the Causalinference package on Python.

Recipe difficulty

  • Statistics: 🔥🔥 - the topic is hard, but this introduction will keep things as simple as possible
  • Technical: 🔥 - easy peasy
  • Time required: 🕜 - around 30 minutes max.

What it is

Whenever you .fit() a statistical model, either in Python or any other statistics language, you have to understand the gap between what you want to measure and what you're actually measuring.

For instance, let's say we're developing a model for a company selling ice-cream in supermarkets. We measure a lot of datapoints about customers, and discover an interesting relationship: it seems that customers carrying umbrellas tend not to buy our ice-cream, and buy hot chocolate instead, whereas customers wearing sunglasses tend to buy our product in large quantities.

This is correlation - what you actually measured in the real world. Obviously it's not what you wanted to measure: what you (or your boss) wanted was a predictive model for targeting potential ice-cream customers. And that's where the pitfall lies.

We can safely conclude, in this case, that there is no causal relationship between umbrellas, sunglasses and ice-cream. The relationship is rather weather -> attire and weather -> ice-cream. If you were to run to your boss with an advertising pitch to send everyone in town a free pair of sunglasses in order to drive ice-cream sales, you'd be regarded as crazy and sent back to your computer with nothing more than a good laugh.

Why is it useful?

Your boss does not laugh in the real world: instead, you get promoted. Indeed, virtually any machine learning model deployed in a production scenario nowadays is a correlation model, not a causation model. If you work in the sector, and chuckled at the ice-cream example, stop right now: for most of the models you've developed so far suffer from this exact same problem.

In many cases, a correlation model is perfectly fine. But when the data size is limited and the amount of counterfactuals very high, basing any decision on a statistical model with little thinking can and will lead to catastrophic outcomes.

Let's take for instance this dataset about student outcomes in Spain. A very common problem in economics and social sciences is the allocation of short-term resources (such as funding and man-hours) to long-term projects (such as inclusion programs, incentives and citizen assistance) in the hope of obtaining a long-term return (economic boost, highly-educated workers, increase in productivity) that is larger than the invested resources.

In this particular case, this mouthful basically translates to the following research question: does it make sense to spend money paying for kids' assistance in order to make them study more?

sex romantic health goout famrel absences Dalc Walc famsup studytime
0 1 0 3 4 4 6 1 1 0 2
1 1 0 3 3 5 4 1 1 1 2
2 1 0 3 2 4 10 2 3 0 2
3 1 1 5 2 3 2 1 1 1 3
4 1 0 5 2 4 4 1 2 1 2

In an ideal world, we'd just take exactly half the kids, assign them to the assistance program, take the other half and give them no additional assistance at all; then compare the difference in study hours between the two cohorts pre- and post-treatment. This is called a randomized controlled trial and it's basically the single most important tool in modern science.

For our particular dataset, we'd be interested in determining the direction and significance of:

\[ E[Y_{\text{hours studied}}|D_{\text{assistance}}] - E[Y_{\text{hours studied}}|D_{\text{no assistance}}] \]

But in the real world, it's often unethical or unfeasible to do so - hence kids were not assigned to either group (the treatment group - that is, the one with assistance - or the control group).

Very often, this type of effects are incredibly difficult to spot - and even more difficult to correct. We are therefore able to identify correlation between co-variates and our target variable, but the two key assumptions of causality are violated.

What are these two assumptions?

  1. Overlap: $ 0 < P(D_{\text{assistance}} | X_i = x) < 1 \text{for any }x ∈ X $

In layman's term, this means that the two cohorts have underlying similarity across all dimensions. For instance, if we only had one explanatory co-variate:

Having no overlap means the groups (treatment and control) would already be intrinsically different - and any conclusion we might find about the effectiveness of the treatment are probably moot.

  1. Unconfoundness: $ Y\perp D|X_i = x \text{for any }x ∈ X $

This means the target variable Y is completely independent of the treatment conditioned on the co-variate set X. Concretely, this means we have to even out the effect of co-variates X on the treatment outcome Y by correcting for them in the causal model. That is, the treatment D can only affect Y without conditioning on the co-variates; when there is conditioning, the effect of treatment of outcome should be random.

In layman's terms, the co-variates X cannot affect the choice of receiving treatment or not.

In our specific case, we already know the second assumption is violated: we know the data is not a randomized trial, and therefore the outcome Y will be affected by the co-variates X's effect on D.

We can start by testing our first assumption using the fantastic causalinference python package:

from causalinference import CausalModel

Y_observed = df['studytime']
D_treat = df['famsup']
covariates = ['sex','romantic','health','goout','famrel','absences','Dalc','Walc']
K_covariates = df[covariates]

cm = CausalModel(Y_observed, D_treat, K_covariates.values)

What we are doing here is creating a CausalModel object, which requires three sets of variables:

  • Y, the observed outcome. It's what we want to measure.
  • D, the treatment. It's what we do to causally affect the outcome. In do-calculus terms, $ P(Y|\text{do}(D)) $
  • X, the set of co-variates or individual charachteristics for each unit, which are potentially affecting either Y or D

In this case we know, given how the dataset was made, that the co-variates X are affecting D. If we have some overlap, however, we can correct for this problem.

To verify overlap between groups in a multivariate context, we use the normalized score:

\[ \text{normalized score} = \frac{\overline{X_{\text{treat}}}-\overline{X_{\text{control}}}}{\sqrt{(S^2_{\text{treat}}+S^2_{\text{control}})/2}} \]

There's no hard rules for the normalized score - in general, if its absolute value is above 0.5/1 we'd have some doubts about having sufficient coverage for our causal correction. In this case, however, the normalized difference is well between ± 0.5 for all co-variates:

Summary Statistics

                       Controls (N_c=125)         Treated (N_t=224)             
       Variable         Mean         S.d.         Mean         S.d.     Raw-diff
              Y        1.880        0.829        2.165        0.849        0.285

                       Controls (N_c=125)         Treated (N_t=224)             
       Variable         Mean         S.d.         Mean         S.d.     Nor-diff
             X0        0.416        0.495        0.585        0.494        0.341
             X1        0.328        0.471        0.321        0.468       -0.014
             X2        3.512        1.418        3.612        1.365        0.072
             X3        3.192        1.141        3.067        1.112       -0.111
             X4        4.016        0.871        3.929        0.875       -0.100
             X5        5.944        9.049        5.978        7.941        0.004
             X6        1.488        0.858        1.420        0.859       -0.080
             X7        2.416        1.381        2.174        1.260       -0.183

The overlap assumption is covered, and we can now move on to correct for the unconfoundness assumption. This is where a naive ML model would lead us to the wrong confusion. Let's see one of these naive models: a regular linear regression.

import statsmodels.api as sm

results = sm.OLS(Y_observed, sm.add_constant(x)).fit()
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.137
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     5.974
Date:                Sun, 10 Mar 2019   Prob (F-statistic):           9.02e-08
Time:                        16:05:12   Log-Likelihood:                -413.12
No. Observations:                 349   AIC:                             846.2
Df Residuals:                     339   BIC:                             884.8
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const          1.9169      0.274      6.993      0.000       1.378       2.456
sex            0.3952      0.092      4.276      0.000       0.213       0.577
romantic       0.1305      0.094      1.386      0.167      -0.055       0.316
health        -0.0295      0.032     -0.926      0.355      -0.092       0.033
goout          0.0205      0.043      0.476      0.634      -0.064       0.105
famrel         0.0234      0.051      0.458      0.647      -0.077       0.124
absences      -0.0076      0.005     -1.431      0.153      -0.018       0.003
Dalc           0.0035      0.067      0.052      0.958      -0.128       0.135
Walc          -0.1076      0.048     -2.245      0.025      -0.202      -0.013
famsup         0.2014      0.091      2.206      0.028       0.022       0.381

It seems indeed like having family support has a positive impact on the study time. The effect is significantly positive, and we'd accept the null hypothesis at the 5% level: giving family support seems to positively affect the student's engagement.

Note that we can do the same using Causalinference alone:

Treatment Effect Estimates: OLS

                     Est.       S.e.          z      P>|z|      [95% Conf. int.]
           ATE      0.201      0.088      2.289      0.022      0.029      0.374

Same effect, same p-value as expected.

But is this actually correct?

As demonstrated by Rosenbaum and Ruben in 1983, we can actually see if any particular covariate influences the distribution to the treatment or control groups for participants by calculating propensity scores.

A propensity score is simply the probability of being assigned to a specific group (treatment or control) D, given the vector of covariates X. Formally:

\[ \text{propensity score} = P(D_\text{treatment}|X)

The concept is surprisingly simple - this is really just a logistic regression - indeed, we can calculate the results using alogistic regression model:

results = sm.Logit(D_treat, sm.add_constant(K_covariates)).fit()

or, more aptly, using Causalinference. The results are exactly the same:

Estimated Parameters of Propensity Score

                    Coef.       S.e.          z      P>|z|      [95% Conf. int.]
     Intercept      0.650      0.711      0.914      0.361     -0.744      2.044
            X0      0.685      0.242      2.831      0.005      0.211      1.160
            X1     -0.122      0.250     -0.491      0.624     -0.612      0.367
            X2      0.111      0.084      1.320      0.187     -0.054      0.275
            X3     -0.033      0.115     -0.291      0.771     -0.258      0.191
            X4     -0.131      0.137     -0.963      0.336     -0.399      0.136
            X5     -0.000      0.014     -0.004      0.996     -0.028      0.028
            X6      0.143      0.174      0.823      0.411     -0.197      0.483
            X7     -0.150      0.125     -1.199      0.230     -0.396      0.095

What this tells us is that variable X0 - sex - is driving assignment to the control/treatment group in a significant manner. Indeed, we can use a simple exponential link to calculate the probability of being assigned to the treatment group for each sample in our dataset. Let's take for instance the very first pupil in the dataset:

sex romantic health goout famrel absences Dalc Walc famsup
0 1 0 3 4 4 6 1 1 0