Post

Introduction to Regression Discontinuity

Introduction

Imagine you have two groups of people who are very similar in every way, except for one thing: a specific cutoff value. This cutoff value determines whether a person receives treatment or not. By analyzing the outcomes of these two groups, we can estimate the causal effect of the treatment on the people who received it.

For example, imagine a university that admits students based on their test scores. If the university only admits students who score above a certain cutoff, we can use Regression Discontinuity to estimate the causal effect of being admitted to the university on a student’s future success.

notes based on The Effect The Book (Ch. Regr. Disc.)

Regression Discontinuity

Regression Discontinuity is a study that analyzes two groups of people who are very similar in every way, except for one thing: a specific cutoff value. This cutoff value determines whether a person receives treatment or not. The treatment is usually a binary variable. The goal is to estimate the causal effect of the treatment on the people who received it by analyzing the outcomes of the two groups.

There are three main components in Regression Discontinuity:

  • Running Variable or Forcing Variable $X$: The variable that will be used to evaluate if you receive the treatment or not $D$. For example, some admissions test score.
  • Cutoff $X_C$: The threshold value at which the treatment and control group are separated.
  • Bandwidth $[X_C - \delta, X_C + \delta]$: The interval where it’s reasonable to think that people on both sides of the cutoff are similar. Beyond that, there might be other variables that change the two groups that are not only related to the treatment.

Some Assumptions

  1. People close to the cutoff are “randomly assigned”.

    We need continuity of the relationship between the outcome and the running variable. If, for example, there’s a positive relationship between the outcome and the running variable, we’d expect a higher outcome to the right of the cutoff than to the left. That’s totally fine, even though the two groups aren’t exactly comparable. As long as there wouldn’t be a discontinuity in that positive relationship without the treatment.

  2. The running variable should not be manipulable by individuals to choose their own treatment.

  3. Those who determine the cutoff should not be able to make that choice based on knowledge of which individuals have which running variable values.

  4. The only thing that should change at the cutoff is the treatment. For example, if two treatments are provided for a certain cutoff score, it is not possible to separate the effect of each treatment. This is also problematic if another variable (a confounder) changes at the same cutoff.

  5. Consequence of assumption 1, we need to be able to “continuously” zoom in on the cutoff interval until both sides of the sample are comparable. However, this is sometimes not possible if the “Running Variable” is binned or discretized to the point where it is no longer possible to reduce that interval.

  6. The statistical power of this methodology might be seriously compromised if there are not enough samples in the neighborhood of the cutoff. Using a wider interval might introduce bias too, so there is always a trade-off.

Finally, the causal graph that represents this study is the following one. Notice the $Z$ backdoor that might or might not be closed

img Nick Huntington-Klein, The Effect “Regression Discontinuity, A Causal Diagram that Regression Discontinuity Works For” 27 Nov. 2022. Used under Fair Use, no commercial intent, for educational purposes only. Original Source

However, even if we have other paths such as $ RunningVariable \rightarrow Z \rightarrow Outcome$ when we condition by the “arbitrary” cutoff we actually can also isolate the effect of the other paths (such as the path via $Z$ ). This makes regression discontinuity more powerful than a simple OLS that controls for a variable.

The procedure is simple (it only consists of 4 steps)

img Nick Huntington-Klein, The Effect “Regression Discontinuity, Step by Step” 27 Nov. 2022. Used under Fair Use, no commercial intent, for educational purposes only. Original Source

The interesting steps are (b) and (c). In (b) we choose a model that predicts values on both ends without using the bandwidth interval also we fit two models one per side.

In (c) we constrained the model and re-fit it in the bandwidth region (both sides again). We should ignore the samples that are too far from the cutoff because we might be adding additional backdoors that break our “randomly assigned” assumption, but also not too close to the cutoff because we will introduce too much variance into our models (fewer samples more overfit), so, there is a trade-off when choosing the bandwidth.

With this result we are only able to estimate the effect of treatment on the bandwidth area, near the cutoff, we can’t estimate or extrapolate this effect to the rest of the population unless we make a huge assumption. Some use cases, such as moving the cutoff, are not limited by this constraint, but if we want to estimate the effects far from the cutoff we might be limited when using this methodology.

Estimation

Using OLS we can fit the following model ($D$ treatment binary). In this particular case, $D$ is an indicator of being treated and also being over the cutoff. This is not valid when we talk about

fuzzy regression

\[Y = \beta_0 + \beta_1(X-X_C) + \beta_2*D+\beta_3(X-X_C)*D+ \epsilon\]

We usually fit this model using a heteroskedasticity-robust standard error procedure. The results are two lines:

  1. $\beta_0$ as intercept and $\beta_1$ as slope (untreated side)
  2. $\beta_0 + \beta_2$ intercept and $\beta_1 + \beta_3$ as slope (treated side)

Finally, the treatment effect will be estimated by $\beta_2$

One advice is to always use a simple model (OLS) and ignore the non-linearity of the data using a reduced bandwidth range. Even if there is a non-linear relationship, on the border of the cutoff, when constrained by the bandwidth, the lineal approximation is as good as a non-linear model when comes to estimating the treatment effect. This is similar to linearly approximating a curve, when we zoom in, the approximation is not that bad, at least as close as needed to determine the ATE. This procedure is called Local Regression This approach, however, will require a fair amount of data.

There are many Local Regression models, some of them include weighted regressions (with triangular kernels) and more popularly the LOESS regression.

Implementation

Manual Implementation (a quadratic version and a kernel one)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import numpy as np
import statsmodels.formula.api as smf
from causaldata import gov_transfers
d = gov_transfers.load_pandas().data

# Run the polynomial model
m1 = smf.ols('''Support~Income_Centered*Participation + 
I(Income_Centered**2)*Participation''', d).fit()

# Create the kernel function
def kernel(x):
    # To start at a weight of 0 at x = 0,
    # and impose a bandwidth of .01, we need a "slope" of -1/.01 = 100
    # and to go in either direction use the absolute value
    w = 1 - 100*np.abs(x)
    # if further away than .01, the weight is 0, not negative
    w = np.maximum(0,w)
    return w

# Run the linear model with weights using wls
m2 = smf.wls('Support~Income_Centered*Participation', d,
weights = kernel(d['Income_Centered'])).fit()

m1.summary()
m2.summary()

There is now a version of the rdrobust package available for python The documentation is poor but the functions are documented on the docstr. There is also some available examples

Fuzzy Regression Discontinuity

Sometimes the cutoff does not apply directly to the treatment, but it increases the probability of having it. This could also include self-selection or opt-in treatments (such as admission or retirement). In the lower plot, it’s illustrated a fuzzy discontinuity vs a sharp design (traditional RDD). In the right plot, you can observe that the percentage amount of treated samples is not straightened 100% but it is an increase in the population.

This topic will not be part of this post, but it’s available on the linked documentation

img Nick Huntington-Klein, The Effect “Regression Discontinuity, roportion Treated in Sharp and Fuzzy Regression Discontinuity Designs” 27 Nov. 2022. Used under Fair Use, no commercial intent, for educational purposes only. Original Source

This post is licensed under CC BY 4.0 by the author.