an mdrd primer
intro
for the last few years, i’ve been using centralized matching procedures to understand the impacts of schools on their students. the “state-of-the-art” methodology for this is a series of papers written by my advisor and coauthors:
- Research Design Meets Market Design: Using Centralized Assignment for Impact Evaluation (Abdulkadiroğlu, Angrist, Narita, and Pathak, 2017) (link)
- Breaking Ties: Regression Discontinuity Design Meets Market Design (Abdulkadiroğlu, Angrist, Narita, and Pathak, 2022) (link)
within Blueprint Labs (where i’m a research associate), we generally refer to these papers as MDRD1 and MDRD2, respectively.
these papers definitely took me a while to understand; i saw them first in year one of grad school, and have since worked to code it up for various projects over the years. three years later, i think they’re some of the papers i understand the most, and i think it’s useful for me to put into words how i understand and teach the paper to help others use it.
this post is focused on the first of these papers (MDRD1). i’m hoping to write another post sometime soon about how MDRD2 works, for which this post will be required reading :)
this post is divided into various sections:
- intro: you’re reading it now! it sets up some preliminaries for this post, including what background you should have and the setting of school choice.
- why is mdrd useful: some background on why these papers even matter, and how they get used.
- calculating propensity scores from deferred acceptance: the meat of mdrd. this is divided into several sections.
- settings 1–3 teach the bulk of MDRD1, where all schools admit students by some type of lottery.
- two quick asides discuss some propensity score properties and how to handle priority groups.
- bringing it home: give the formulas derived in the last section, how do we take this to real data? why does this work? what doesn’t it do well?
- conclusion: some parting thoughts.
- a future post: MDRD2, where we now allow for schools to instead use a “screened” score, like an audition score, test score, etc. if you don’t have any such schools, MDRD1 is all you need!
who’s this for + prereqs
the intended audience of this post is decently wide. i’ve tried to keep it at a level where an undergrad who has taken econometrics can understand what’s going on (so long as they also know deferred acceptance). i think it’s also useful more broadly for researchers, who are looking understand the MDRD papers at a deep level, and are looking to implement them in their research. this guide will start from the ground up, and so there may be some parts you skim, but i think it can be useful for understanding what’s going and figuring out to put it into practice.
unfortunately, this post isn’t quite “for everyone”. of course, anyone is welcome to read it — i have explicitly tried to make the material as accessible as possible — but for you to get the most out of it, there’s a few econ/math concepts you really should know.
- (a must) you know how student-proposing deferred acceptance works. (sometimes, this is known as the gale-shapley algorithm for stable matches.) in the next section, i do describe it a little — but i personally don’t think it’s fleshed out enough for someone who doesn’t know DA to understand deeply.1
- (a must) you have knowledge of probability (mostly at the level of “odds of independent events can be multiplied to find overall probability”), and conditional probability ($P(A) = P(B)P(A\mid B)$).
- (a must) you know what a regression is, how to interpret coefficients (including those that are indicator variables).
- (highly recommended) you know how instrumental variables works. i have a bit of a non-technical primer here, but it is probably best if you know how they work in terms of regressions.
- (useful, not required) some familiarity with RCTs/propensity scores and Rosenbaum + Rubin (1983); i’ll be giving a high level overview in this post, but being familiar with propensity scores will definitely help.
- (optional) some knowledge of sequence convergence and the continuous mapping theorem. this is used in the “why does mdrd work” section. that section will primarily be a non-technical overview of it, but in footnotes, i’ll add some extra notes about the math behind it working.
- (highly recommended, but only for the next post) you’re aware of regression discontinuities and how they work. however, these won’t come into play until we get to MDRD2, not in this post.
about our setting of school choice
in this post, we’re going to be thinking about school choice, when students are assigned to schools through a central mechanism. to be specific, we will need:2
-
a set of students \(I = \{i_1, i_2, \dots i_n\}\).
-
a set of schools \(\{0, 1, \dots, S\}\); school $0$ will represent the outside option (often being unassigned).
-
school $s$ will have capacity $q_s$; the outside option (school $0$) has enough capacity so that all students can be assigned there (i.e., $q_0 > n$).
-
each student has preferences over all schools. if student $i$ prefers school $a$ to $b$, we write $a \succ_i b$. student $i$’s preferences over schools are strict3 and complete4 and are collected in $\succ_i$.
-
we’ll then say that each student has a type $\theta_i = (\succ_i) \in \Theta$. In essence, a students’ type is just their preference ordering.
note: MDRD, we allow schools to have priorities $\rho$ over students, e.g., all students with a sibling at the school are ranked before everyone else. in this primer, we’re going to start by saying that there are no such priority groups. we’ll come back to it at the end (specifically, here) — just making a note here to avoid confusion if you’re following along with the papers. when that happens, student type will be given by the tuple $\theta_i = (\succ_i, \mathbf{\rho}_i)$, where $\mathbf{\rho}_i$ collects $i$‘s priority at every school.
-
schools rank students on the basis of some score, $r_{i}^v$. this might be a lottery number, test scores, results of an audition, etc. we normalize these scores to be on the range $[0, 1]$ and so that students with lower scores are admitted first. the $v$ in $r_i^v$ is an index, where each $v$ represents a different score type. multiple school can use the same score type (e.g., students in NYC have a singular lottery draw across all schools, students may all be ranked according to the same test, etc.) the key is that $r^v$ needs to contain all of the information for ranking students. let $v(s)$ be the score type that school $s$ uses. we’ll let there be $V$ such different scores, and for now, we’ll assume that all of these scores are a lottery, and are independent of student types.5
-
students then apply to schools using student-proposing deferred acceptance (DA). as a quick refresher: in round $k$, students without a current assignment apply to their top choice that hasn’t yet rejected them. schools then tentatively accept up to $q_s$ applicants among those that applied in round $k$ and those tentatively accepted from round $k-1$; school $s$ tentatively accepts students on the basis of $r_i^{v(s)}$ (i.e., taking the $q_s$ applicants from this round with the lowest scores per $r_i^{v(s)}$). the rest are rejected. those rejected students apply to their next-most-preferred school in round $k+1$. this process repeats until all students are either matched, or have no other school to apply to. We often call the result of this matching $\mu$, so that if $i$ is matched to $s$, $\mu(i)=s$; in the MDRD papers, we also let $D_i(s)$ be a dummy indicating whether $i$ is assigned to $s$.
an equivalent formulation of DA that will be of lots of use to us is to represent it in terms of cutoffs. to know the outcome of DA, it is sufficient to know student preferences, their scores, and the score of the marginal admitted student at each school.6 we can collect these in a vector $\mathbf{c} = (c_0, c_1, c_2, \dots, c_S) \in [0, 1]^{S+1}.$ we’ll also set $c_s=1$ for any school that never rejects anyone, including $c_0$, to show that they’re under-demanded schools.7 then, each student is admitted to their most-preferred school for which they meet the cutoff: i.e., $r_i^{v(s)} \le c_s$.
why would cutoffs give the same outcome as the usual DA? note that in this cutoff representation, $s$ misses the cutoff at schools they prefer to $\mu(i)$ and so gets rejected, but they make the cutoff at $\mu(i)$, and so are accepted there. and so every student gets matched to exactly $\mu(i)$!
why is mdrd useful?
say that we want to evaluate the effect of attending a specific school $s$ for students. we’ll do this evaluation with respect to some outcome $Y$. in many places and papers, the outcome of interest is test scores. but we might be interested in some other outcome: whether you go to/graduate college, wages earned later in life, whether you stay in the city as an adult, or just plain old happiness. all of these are valid outcomes — it’s up to us as researchers to decide what yardstick we want to use.
but whatever $Y$ we choose, we can’t just compare the outcomes of students who attended $s$ and those that didn’t — there’s bound to be selection bias due to who chooses to enroll in $s$. it’s the same reason that we shouldn’t just compare schools by comparing their students’ average test scores or graduation rates: we don’t know whether it’s because the school is good, or whether those students would have had high test scores even if they didn’t attend $s$. ideally, we’d like to observe the same student attending $s$ and not attending $s$, but because each student can only do one of those two, we can’t.
but we’re not completely out of luck: many school systems use some type of centralized admission system to admit students to schools, and oftentimes, these systems have some form of a lottery. for such a lottery to be fair, the system should have Equal Treatment of Equals (ETE): any two students of the same type $\theta$ should have the same likelihood of being admitted to a given school $s$. in essence, ETE just means that $\theta$ contains all of the information that determines a students’ odds of admission. that allows us to let $p_s(\theta)$ be the probability that type $\theta$ is admitted to school $s$, and know that it gives a “complete” picture of admissions probabilities. note that deferred acceptance is a system with ETE — two students with the same preferences should have the same odds at every school, because what school they get into is just decided by their random lottery numbers.
an expository example
suppose that every student attends school $s$ with independent probability $p_s(\theta) = 0.3$; let’s represent attendance of students $i$ with the indicator variable $D_i$. say that we wanted to find the causal8 effect of attending school $s$, which we’ll call $\beta_s$.
using the framework of the Rubin causal model,9 let $Y_i^1$ be the outcome of student $i$ if they attend $s$, and $Y_i^0$ if they do not. we only observe one of these at a time — specifically, $Y_i^{D_i}$ — because each student either attends or doesn’t attend $s$. because of this, we call $Y_i^0$ and $Y_i^1$ potential outcomes.
by definition, $\beta_s$ tells us what happens to every student because of attending $s$. in other words,
\[\beta_s = \mathbb{E}\left[Y_i^1 - Y_i^0\right].\]however, as discussed earlier, we don’t observe both $Y_i^1$ and $Y_i^0$ at the same time, so can’t just calculate this directly. however, because assignment was random, we can do a fun little trick:
\[\begin{align*} \beta_s &= \mathbb{E}\left[Y_i^1 - Y_i^0\right] \\ &= \mathbb{E}\left[Y_i^1\right] - \mathbb{E}\left[Y_i^0\right] \\ &= \mathbb{E}\left[Y_i^1 \mid D_i = 1\right] -\mathbb{E}\left[Y_i^0 \mid D_i = 0\right] \end{align*}\]the third line here is the clever step — because school assignment $D_i$ was done randomly, it must have been done independently of students’ potential outcomes, so we know that $\mathbb{E}\left[Y_i^1\right] = \mathbb{E}\left[Y_i^1 \mid D_i = 1\right]$ and $\mathbb{E}\left[Y_i^0\right] = \mathbb{E}\left[Y_i^0 \mid D_i = 0\right] $.
but why exactly is this useful? well, we actually observe each of these two quantities! they’re just the mean outcomes of students assigned to $s$ and not assigned to $s$, respectively.
equivalently, we could represent this in the form of a regression:
\[Y_i = \beta_0 + \beta_s D_i(s) + \varepsilon_i\]then, $\beta_s$ will exactly be the quantity described above. 10
more generally
there are two things that make the real world more complicated than this example; however, we have some ways to get around those issues.
-
students might be assigned to $s$ with different probabilities, based on their preferences, where they live, school priorities, etc. it seems like there still might be a lot of omitted variable bias here, since all of these things could plausibly be related to our outcome $Y$. (e.g., if your probability of assignment is higher because you live close to a good school, housing prices there might be higher, and so your family might be more well off.) we can’t just control for all of these things; controlling for every possible student preference ranking is far too many controls to be useful.
to get around this, we can use the idea of propensity score conditioning.11 at an intuitive level, this procedure says: take a look at some group of students who all have the same likelihood $p$ of attending $s$. by the logic above — if we compare the mean outcomes of assigned and unassigned students, we should get an unbiased estimator of the true treatment effect. but we can do this for all different values of $p$ simultaneously by adding them in as controls in our regression.12
this means that we end up estimating the equation
\[Y_i = \beta_s D_i(s) + \sum_x \alpha_2(x)d_i(x) + \eta_i \tag{*}\]where each \(d_i(x) = 𝟙\{p_s(\theta_i)=x\}\). essentially, this adds a dummy for each distinct propensity score $x$ to the regression.
when we do so, we don’t need to include any other controls; these propensity scores control for all of the selection bias that is there. (other controls, like student demographics, can still be included to help increase precision, but are not needed to have unbiased estimates.) we run a separate regression for each school $s$ of interest.
another way of thinking about why this works: controlling for everything doesn’t work because there’d be too many controls. controlling for propensity scores does the same thing, but with fewer controls on the left hand side — this is what Hahn (1998) calls the “dimension reduction” effect of propensity scores. we’re basically finding ways to group students together.
a quick aside: in order for this to work, we need to restrict our sample to students who have “risk” at school $s$, i.e., their propensity score is not $0$ or $1$. intuitively, for students who always go to school $s$, there’s no comparable set of students who weren’t assigned to school $s$ to compare them to. so they don’t actually help us at all finding the effect of going to school $s$. so we drop all of these students without risk :)
an aside to the aside: just because a school accepts everyone doesn’t mean there’s no risk. suppose that there’s some school $s_1$ where students are accepted with chance $\frac12$, and $s_2$ accepts everyone. all students have preferences $s_1 \succ s_2$. even though $s_2$ accepts everyone, every student still has only a $50\%$ chance of being admitted. this risk is generated by the random assignment at schools preferred to $s_2$.
-
in almost every circumstance, there is imperfect compliance with the results of a lottery. some students might end up at another school, some students might get in off of the waitlist.
to help, let’s turn to our old friend instrumental variables. as a quick recap of the relevant bit: if we want to find the effect of attendance ($D$) on some outcome ($Y$), we can instrument our regression of $Y$ on $D$ with some third $Z$. there’s a few conditions this needs to satisfy: $Z$ needs to covary with $D$ (relevance), and also needs to only affect $Y$ through $D$ (exclusion restriction). it is natural here to take as $Z$ the results of the lottery — in essence, instrumenting attendance with admission.
when we do so, the coefficient in front of $D$ will the treatment effect of school $s$ for students whose attendance decision was changed because of the results of the lottery.13
to be specific — we take equation $(*)$ , and instrument it using the equation
\[D_i(s) = \gamma Z_i(s) + \sum_x \alpha_1(x)d_i(x) + \nu_i \tag{$\dagger$}\]note that because we controlled for propensity scores in the second stage (equation $(*)$), we also should in the first stage. it’s only fair.
note: in all of these estimating equations, we’re finding the causal effect of going to school $s$, where the alternative is “doing anything else”. sometimes, we might be interested specifically in the effect of going to school $s$ against the alternative of not going to any school. this, in some sense, is closer to the idea of “school value-added”. if we were to do so, we’d want to add in indicators and propensity scores for attending a school that isn’t $s$ (but still attending a school). if we let these be $D_i(-s)$ and $d^{-s}_i(x)$, we’ll turn equation $(*)$ into
\[\begin{equation*} Y_i = \beta_s D_i(s) + \sum_x \alpha_2(x) d_i(x) + \beta_{-s} D_i(-s) + \sum_x \alpha_3(x) d_i^{-s}(x) + \eta_i \end{equation*}\]so that our baseline comparison is against those that didn’t go to school at all. (we’ll add in these same covariates into $(\dagger)$, as well.)
finding propensity scores
given the above — to determine causal effects of attending school $s$, all we need is (a) their admission records, (b) their attendance, and (c) the likelihood that each student is admitted to school $s$. those first two are in administrative data — but figuring out that third part is much easier said than done.
a natural way to do this might be to simulate the school lottery a bajillion times, and see the likelihood that each student is admitted to school $s$. this would be an unbiased estimate of the true odds of assignment! but when we think about implementing this, there’s a bit of an issue. we need to simulate a large number of times $N$ in order to have a good estimate of the propensity scores $p$. but when we do so, two students with exactly the same propensity score of assignment to $s$ are unlikely to be assigned to $s$ the same number of times. (this is for the same reason that if two people flip a fair coin a million times, they almost certainly won’t flip the same number of heads.) now, we could “bin” similar propensity scores together, but there’s not really good theory-based reasons for choosing any bin size. why would we want to choose 0.1 over 0.01, over 0.0001? it’s very unclear.
this is where mdrd comes in. mdrd is useful because at its core, it is a methodology to approximate student propensity scores in a way that inherently creates dimensionality reduction.
before getting into the details, here’s a few other ways that people have used as proxies for propensity scores.
- instead of using propensity scores, control for whether students have the same “risk set”; i.e., exposure to the same schools in their preference lists. essentially, this idea works because it controls for students’ types completely, eliminating omitted variable bias. however, in practice, this might lead to a lot of different controls. there are about $S^k$ different preference lists of length $k$, and for large $k$ (e.g., NYC has $k = 12$), we’re stuck with the curse of dimensionality.
- as a proxy, use whether students applied to $s$ as their first choice. e.g., in Deming (2011). this method does work completely — however, it loses out on the power generated from the randomness of students admitted to their second choices and beyond.
with that all in mind, let’s turn to…
how to approximate propensity scores in da
throughout the next few sections, we’re going to imagine a market with a very large number of students (i.e., $n\to\infty$). this leads to a cool property: the cutoff at each school is going to be the same no matter the lottery draws. intuitively, this happens because if any one person’s lottery number changes, then cutoffs won’t change very much.14
but the key thing for us — we can treat each schools’ cutoffs as a constant which is completely independent of the lottery draws, and never changes. let the cutoff of school $s$ be $c_s$.
note: if you’re following along with MDRD1, there are a few changes to my exposition / notation:
- as mentioned above: i’m fully ignoring priority groups for now. we’ll get to that towards the end of the post (specifically, here)
- MDRD1 refers to infinite-market propensity scores using $\varphi$, rather than $p$. i’m going to stick with $p$ for now, because it’s easier to remember “$p$ for propensity scores”. we’ll come back to this when we explain why MDRD works.
to learn about the methodology, we’re going to think through a few different examples, which will let us derive the MDRD1 formulas.
setting 1: separate lotteries
consider an infinitely sized market where each school uses a separate, independent, lottery to assign students (i.e., a separate $v$ for each school.). suppose student $i$ of type $\theta$ has preferences $1 \succ 2 \succ 3 \succ 4$, where school $s$ has a cutoff of $c_s$. what is the probability that a student of type $\theta$ is assigned to school $4$, or $p_4(\theta)$?
you might be able to answer this quite quickly! i’m going to write it out in detail in a very specific way, because it’ll be helpful for when we get to our more complicated settings.
note that two things need to happen: $i$ needs to be rejected from every school they like more than school $4$, and then they need to make it into school $4$ (given that they got rejected from everywhere previous). for ease of notation, let $B_{\theta 4}$ refer to the set of schools that a person of type $\theta$ likes more than school $4$ (where $B$ stands for “better”, as in “better than school $4$ for type $\theta$”).
we can write this out as
\[\begin{align*} p_{4}(\theta) &= P(\text{rejected from all } s \in B_{\theta 4} \text{ and } \text{accepted to $4$}) \\ &= P(\text{rejected from all } s \in B_{\theta 4}) \times P(\text{accepted to $4$}\mid\text{rejected from all }s \in B_{\theta 4}) \tag{$\S$} \end{align*}\]a statement which is true by the laws of conditional probability.
now, let’s evaluate each of these two terms. starting with the first: $i$ needs to miss the cutoff of every school in $B_{\theta 4}$. so at each of these schools, $i$ needs to have a score in $[c_s, 1]$. because each of these are independent events, the total probability of all of these happening is just their product: $\prod_{s\in B_{\theta 4}} (1 - c_s) = (1-c_{1})(1-c_{2})(1-c_{3})$.
turning now to term two: $i$ needs to make the cutoff at school $4$, i.e., have a lottery draw less than $c_{4}$. well, $i$‘s draw at school $4$, which is $r_i^4$, is fully independent of previous draws, and so this probability is simply $c_{4}$.
our final answer is then the product of these two:
\[\boxed{p_{4}(\theta) = \underbrace{(1-c_{1})(1-c_{2})(1-c_{3})}_{P(\text{rejected from all } s \in B_{\theta 4})} \times \underbrace{c_{4}}_{P(\text{accepted to $4$}\mid\text{rejected from all }s \in B_{\theta 4})}}\]this isn’t too hard to generalize to any setting where each school uses a different lottery. by a similar process, we can say that
\[p_{s}(\theta) = \underbrace{\prod_{s\in B_{\theta s}} (1 - c_s)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{c_{s}}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})} \tag{1}\]while the formula might look confusing; at its core, we’re just checking that a person of type $\theta$ doesn’t get in anywhere they like more than $s$, but that they actually get into $s$. all of these probabilities are independent, so we can multiply them to find our answer!
setting 2: one lottery draw
setting two will be just like the first setting, but with one clear difference: all schools now use the same lottery draw (i.e., all schools use the lotto draw $v = 1$). we’re going to run through an example where we think from first principles about the probability that $i$ is assigned to different schools.
suppose again that $i$ is of type $\theta$, and has preferences $1 \succ 2 \succ 3 \succ 4$, and that the cutoff vector is given by $\mathbf{c} = (c_1, c_2, c_3, c_4) =(0.3, 0.7, 0.6, 0.9)$.
-
what’s the probability that $i$ gets assigned to school $1$? all that needs to happen is $i$ has a lottery score that clears school $1$‘s cutoff; this happens if the lottery draw is in the range $[0, 0.3]$. this happens with probability $\boxed{p_1(\theta) = 0.3}$.
-
what’s the probability that $i$ gets assigned to school $2$ our student $i$ clears the cutoff at school 2 anytime that $r_i \le 0.7$. however, our answer here isn’t $0.7$. that’s because if $i$ draws something in $[0, 0.3]$, they’ll be assigned to school $1$ instead! so, $i$ needs to have a draw in the range $[0.3, 0.7]$, which happens with probability $\boxed{p_2(\theta) = 0.4}$.
what’s the probability that $i$ gets assigned to school $3$? let’s imagine that we’re working through the rounds of DA. If $i$ is applying to school $3$, then this means that they’ve already been rejected from school $1$ and $2$. but because they were rejected from school $2$, we already know that $i$ has a score larger than $0.7$. now, school $3$ only admits students who have a lottery draw less than $0.6$; so there’s no way that $i$ gets assigned to $3$! this means our answer is $\boxed{p_3(\theta) =0}$.
-
what’s the probability that $i$ gets assigned to school $4$?
stating the intuition we’ve built up in a slightly different way: to know the odds that $i$ gets assigned to school $4$, they both need to make the cutoff at school $4$, and miss every cutoff at schools that they prefer. when do they miss the cutoff at every school they like more than $4$? well, the least selective school that $i$ didn’t get assigned to was school $2$; because of that, we know that her score has to be larger than that cutoff: $r_i > 0.7$. they make the cutoff at school $4$ anytime $r_i\le 0.9$; putting these two pieces together we know they need a score in the range $[0.7, 0.9]$, which happens with probability $\boxed{p_4(\theta) =0.4}$.
a specific thing i want to draw your attention to: when we were looking at admissions chances to school $4$, we didn’t need to use the cutoffs at schools $1$ and $3$. this is because they had stricter cutoffs than school $2$. it tells us more that $i$ got rejected from $2$ (after which we know $r_i>0.7$) when compared to the information we get from schools $1$ ($r_i > 0.3)$ or $3$ ($r_i>0.6$).
in other words, if we wanted to calculate the chance $i$ is accepted to a school $s$, the only two pieces of information we need to know is the cutoff at $s$ (which is $c_s$), and the most lenient cutoff she missed among schools she liked better (as a reminder — these more-favored schools are given by the set $B_{\theta s}$).
we’re going to call this most lenient cutoff the $\boldsymbol{\mathrm{MID}}$ that this student has when applying to school $s$; MID stands for “most informative disqualification”. in essence, it just tells us the highest cutoff at a school that a student likes more than $s$. in notation:
\[\mathrm{MID}_{\theta s} = \max\left\{c_{s’}\mid s' \in B_{\theta s}\right\}\]now let’s think generally — what is the probability that a student of type $\theta$ is accepted to school $s$ in their preference list?
again, we know that they need to miss the cutoff at every school that like more, but still make the cutoff at $c_s$. this only happens if they have a score in the range \([ \mathrm{MID}_{\theta s}, c_s ]\) — if they score lower than \(\mathrm{MID}_{\theta s}\), they would get into somewhere better, and if they score higher than $c_s$, then they won’t make it into $s$, either.
of course, it might be the case that $c_s < \mathrm{MID}_{\theta s}$, and $s$ has a stricter cutoff than some school that you already got rejected from. if so, there’s no way you make it to $s$ — otherwise you’d have gotten into the school that set the $\mathrm{MID}$! — and so your odds of assignment are 0.
to put it all together: we can represent the probability that type $\theta$ is admitted to school $s$ as
$p_s(\theta) = \max(0, c_s - \mathrm{MID}_{\theta s})$
(in order to make this well defined, let’s also just say that when applying to your first choice, your $\mathrm{MID}$ is 0.)
i’m now going to write this probability in a slightly different way. going back to our example above, and looking again at school $4$; one way to state this probability is that $i$ needs to get rejected from every school that they like better than school $4$, and conditional on that rejection, they need to be admitted to school $4$. this is exactly the same statement as we made in setting 1 — despite the change in how scores work, this logic still holds, and so we can write the same formula as in equation $(\S)$ in setting 1:
$p_{4}(\theta) = P(\text{rejected from all } s \in B_{\theta 4}) \times P(\text{accepted to $4$}\mid\text{rejected from all }s \in B_{\theta 4})$
the first term — probability of being rejected at schools better than school $4$ — is the probability we miss the most lenient cutoff, which happens with probability $1 - 0.7 =0.3$. the second term is a bit more complicated — given that you’re rejected, we know that $i$‘s score is more than $0.7$. in order to be accepted to school $4$, we also need a score less than $0.9$. this means that the conditional probability is $\frac{0.9-0.7}{0.3} = \frac23$.
our final answer is the product of these two numbers: $(1-0.7)\times\frac23 = 0.2$ — thankfully, the same answer we originally calculated.
thinking about this logic more generally leads us to the following equation:
\[p_{s}(\theta) = \underbrace{\left(1-MID_{\theta s}\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{\max\left( \frac{c_{s}-MID_{\theta s}}{1-MID_{\theta s}}, 0 \right)}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})} \tag{2}\]this is equation 2 in MDRD1!15 under the hood, all that it is doing is multiplying the probability that you are rejected from places you like more than $s$, and conditional on that, you still get into $s$.
note also that we can multiply the first term through the $\max$, and recover the original equation we found: $p_{s}(\theta) = \max(c_s - \mathrm{MID}_{\theta s}, 0)$ — and so they’re equivalent formulations! the form that is written in MDRD is done in that way because it highlights this as a “conditional” probability.
we’ve done it! at least, partially. but it’s time to get a bit more complicated — how can we unite this formula with our first setting?
setting 3: all schools use a lottery, but some use the same one
oh yeah. we’re going real general.
while it might seem like this is just a technical edge case to go through, it’s something that happens a lot in real life — for example, all charter campuses run in some charter network might use the same lottery draw, but the rest of schools in Chicago Public Schools all use a CPS-drawn number.
let’s look at the same example as we did in setting 2, but now, $v(1) = v(3) = 1$, and $v(2)=v(4)=2$; in real words, schools $1$ and $3$ use the same lottery draw, and schools $2$ and $4$ use the same lottery draw, but one that’s distinct from the one used by schools $1$ and $3$. we’ll also keep the same cutoff vector: $\mathbf{c} = (c_1, c_2, c_3, c_4) =(0.3, 0.7, 0.6, 0.9)$. as a notation reminder, $r_i^v$ will be $i$‘s random number for score $v$.
-
what’s the probability that $i$ gets assigned to school $1$?
by the same logic as in setting 2, it’s still $\boxed{p_1(\theta)=0.3}$.
-
what’s the probability that $i$ gets assigned to school $2$?
in this case, the logic that we used in setting 2 doesn’t hold. but, when calculating this probability, it’s exactly like in setting 1 — the two schools of interest use two different lotteries. again, we need to get rejected from school $1$ (which has probability $1-0.3 = 0.7$), and then we need to clear the cutoff at school 2 ($0.7$); this gives an overall probability of $\boxed{p_2(\theta) = (1-0.3)0.7 = 0.49}$ .
-
what’s the probability that $i$ gets assigned to school $3$?
now, we finally have to do something slightly different. let’s look at this in terms of the two score types. for score type 2, we know that we need to get rejected from school $2$; this means that we need $r_i^2 \in [0.7, 1]$. For score type 1, $i$ needs to get rejected from school $1$ and subsequently accepted to school $3$; for that, we need $r_i^1 \in[0.3, 0.6]$. These are independent draws, so we can just multiply the probability of each: the answer is $\boxed{p_3(\theta) = (1-0.7)(0.6-0.3) = 0.09}$.
-
what’s the probability that $i$ gets assigned to school $4$?
following a similar logic as the last bullet — we need score type 1 to be larger than $0.6$, and score type 2 to be in the range $[0.7, 0.9]$. this makes our overall probability $\boxed{p_4(\theta) = (1-0.6)(0.9-0.7) = 0.08}$.
an alternative way to calculate this probability, we can think about finding $p_4(\theta)$ in the framework introduced in the last two settings and equation $(\S)$, instead calculating
$p_{4}(\theta) = P(\text{rejected from all } s \in B_{\theta 4}) \times P(\text{accepted to $4$}\mid\text{rejected from all }s \in B_{\theta 4})$
looking at the first term: in order to get rejected from all schools preferred to school $4$, $i$ needs to miss the most lenient cutoff for each score type. for score type $1$, this is $0.6$, and for score type 2, $0.7$; in total, this happens with probability $(1-0.6)(1-0.7) = 0.12$.
conditional on having been rejected, what’s the probability of $i$‘s admission to school $4$? from the earlier rejection to school $2$, we know that $i$ must have had $r_i^2 > 0.7$. thus, clearing the cutoff at school $4$ happens with probability $\frac{0.9-0.7}{1-0.7} = \frac23$.
taking the product of these two numbers gives $\boxed{p_{4}(\theta) = (1-0.6)(1-0.7)\times \frac23 = 0.08}$ — the same answer as before!
and so with that, let’s now think about the general case. to calculate the odds of type $\theta$ being admitted to school $s$, we can again plug into our conditional formula: the probability of being rejected from all schools in $B_{s\theta}$ multiplied by the probability of getting accepted by $s$ (conditional on being rejected from all schools in $B_{s\theta}$). looking at the two terms:
-
so, what’s the probability of being rejected from all schools you like more than $s$? well, you need to miss the most lenient cutoff of all different score types. this makes us think that we need a separate $\mathrm{MID}$ for each score type — let’s call them \(\mathrm{MID}_{\theta s}^v\) for each $v$. formally: let \(B_{\theta s}^v\) be the set of schools that $\theta$ likes more than $s$ that admit students using score $v$. then,
\[\mathrm{MID}_{\theta s}^v = \max \left\{c_{s'} \mid s' \in B_{\theta s}^v \right\}\]in order to miss the cutoffs of each score type, since these are independent events, we can take the product of missing each one!
\(P(\text{rejected from all } s' \in B_{\theta s}) = \prod_v (1-\mathrm{MID}_{\theta s}^v)\).
-
what about the probability of being accepted to $s$ after these rejections? again, we need to have a score that meets the cutoff at $s$, but we know that we don’t have a score below $\mathrm{MID}_{\theta s}^{v(s)}$. that means that this conditional probability is given by
\[P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s}) = \frac{c_s - \mathrm{MID}_{\theta s}^{v(s)}}{1-\mathrm{MID}_{\theta s}^{v(s)}}\]so long as this quantity is non-negative — otherwise, it is 0.
that means that our overall probability can be computed as
\[p_{s}(\theta) = \underbrace{\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{\max\left( \frac{c_{s}-\mathrm{MID}_{\theta s}^{v(s)}}{1-\mathrm{MID}_{\theta s}^{v(s)}}, 0 \right)}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})} \tag{3}\]which seems like a monster of a formula, but just boils down to multiplying the probability of earlier rejection and the probability of acceptance, conditional on earlier rejection.
this result is in an appendix of MDRD1, and is used implicitly in MDRD2. on another note, if each school has a different score, this formula results in the exact same formula that we derived in setting 1! (the first product includes all preferred schools and their cutoffs, and the second term becomes just $c_s$.)
two quick asides
before wrapping things up, i wanted to draw attention to two separate things that will help in actually understanding/implementing MDRD.
properties of MDRD propensity scores
the nature of these propensity scores leads them to have some cool properties.
- for each type $\theta$, the sum of propensity scores across all schools is 1: $\sum_s p_s(\theta)=1$. with certainty, this student is assigned somewhere, though that somewhere could be $s_0$, corresponding to going to the outside option (often, being unassigned).16
- for each school $s$, the sum of assigned propensity scores across students is less than its capacity: $\sum_i p_s(\theta_i) \le q_s$. we can prove this by laws of probability and contradiction. suppose that this sum was larger than $q_s$. note that the sum represents the expected number of students assigned to $s$. that implies that under some lottery draw, $s$ is assigned more than $q_s$ students.17
- as you go down a preference list, every $\mathrm{MID}_{\theta s}^v$ weakly increases. moreover, as you move down one spot in a preference list from $s$ to $s’$ , the only $\mathrm{MID}$ that will change is the one for $v(s)$, the score type used by $s$.
- because only one $\mathrm{MID}$ changes with each priority list, we can write some fairly straightforward code. in particular, we can run an algorithm to compute every $p_s(\theta)$ in $O(n (SK+V))$ time. For each student (of which there are $n$), we need to write a $\mathrm{MID}$ for each score type for the first school (of which there are $V$ types), then loop through each school-priority group in the students’ preferences (of which there are at most $SK$) to compute the new $\mathrm{MID}$ and then compute the relevant p-score.18
modification: handling priority groups
in many school settings, we have priority groups; certain sets of students are all ranked above some other sets of students. this might happen due to neighborhood preference, sibling priority, and many many other reasons. also, some students might be ineligible.
if students are eligible for at most one priority group (which we’ll assume for now), we can represent this fairly easily. let student $i$ with some type $\theta$ have a priority at each school $s$ of \(\rho_{\theta s} \in \{0, 1, 2, \dots, K, \infty\}\). then, have schools admit students in order of increasing $\rho_{\theta_I s} + r_i$. (lower priority groups are more preferred, and priority group $\infty$ will be reserved for students ineligible at school $s$). then, we just run deferred acceptance like normal! again, we can represent the outcome using cutoffs at each school; it’s just that $c_s$ now can be in the range $[0, K+1]$. (analogous to above, we’ll set $c_s = K+1$ anytime a school is under-demanded.) (We’ll also now let types be the tuple consisting of preferences and priorities at each school: $\theta = (\succ, \rho)$, where $\rho= (\rho_1, \rho_2, \dots, \rho_S$).
but then how can we turn this into a format that works for MDRD? there are two different ways (the first here is what’s presented in MDRD; the second is what i personally think is a more intuitive way of understanding, and works when students are eligible for multiple priority groups and there are priority-group specific capacities):
-
note that each school will have exactly one marginal priority group, which we’ll call \(\rho_s\in\{0, 1, 2, \dots, K\}\). students who have $\rho_{\theta s} < \rho_s$ never are assigned to any school they like less than $s$, and students with $\rho_{\theta s} > \rho_s$ never get assigned to $s$. given a cutoff $c_s$, this marginal priority group is simple to find: $\rho_s = \lfloor{c_s}\rfloor$. in essence, it’s the set of students that have direct risk at $s$. however, other students will also still have risk at $s$ — the students with $\rho_{\theta s} < \rho_s$; but their risk will be due to their risk of not being admitted to schools they like more than $s$.
to make this idea a little more formal, for each school $s$, we’ll split students into three groups:19
- $t_{is}=a$ if $\rho_{\theta s} < \rho_s$; here, $a$ stands for always assigned. this condition means that a student belongs to a priority group better than the marginal priority group. if, at any point during deferred acceptance, $i$ sends a proposal to $s$ (which only happens when they get rejected from every school they like better than $s$), then $i$ will always be accepted by $s$. they are never assigned to a school worse than $s$.
- $t_{is} = c$ if $\rho_s = \rho_s$; here, $c$ stands for conditionally assigned. this condition means a student belongs to the marginal priority group. as a result, if $i$ sends a proposal to $s$, then this student isn’t always accepted at $s$, because their acceptance depends on their lottery draw.20
- $t_{is}=n$ if $\rho_{\theta s} > \rho_s$; here, $n$ stands for never assigned. this condition means that a student belongs to a priority group worse than the marginal priority group. if $i$ sends a proposal to $s$, they are never accepted by them because their priority group is too low.
we’ll also introduce the idea of lottery cutoffs: $\tau_s = c_s-\rho_s$; in essence, it’s the worst lottery score that got admitted for students in the marginal priority group.
to do the same types of calculations as we wanted to above, we’d then need to augment it slightly. trying to follow along the same logic:
-
$\mathrm{MID}_{\theta s}^v$ now should only be calculated using the set of schools that type $\theta$ likes more than $s$ and use $v$ as a score, but students also need to belong to the marginal priority group. (otherwise, their success/failure of gives us no information about $i$‘s lottery draw.) also, we should calculate it using $\tau_s$, instead of $c_s$, so that we look at the lottery number directly. so, define $\mathrm{MID}$ as
\(\mathrm{MID}_{\theta s}^v = \begin{cases} 0 & \text{if } \rho_{\theta s'} > \rho_{s'} \text{ for all } s' \in B_{\theta s}^v, \text{ or } B_{\theta s}^v = \varnothing \\ 1 & \text{if } \rho_{\theta s'} < \rho_{s'} \text{ for some } s' \in B_{\theta s}^v \\ \max \left\{ \tau_{s'} \mid s' \in B_{\theta s}^v \text{ and } \rho_{\theta s'} = \rho_{s'} \right\} & \text{otherwise} \end{cases}\) this formula looks really complicated, but what each line is saying is:
- $\mathrm{MID}$ is $0$ if you got rejected by priority at every school that uses $v$ as a score and that you like more than $s$ (or if there’s no schools that use $v$ in that set); this is because those rejections don’t carry any score information.
- $\mathrm{MID}$is $1$ if you were guaranteed an acceptance because of priority at some school that uses $v$ as a score and that you like more than $s$; because of that acceptance, you’ll never be assigned to $s$.
- otherwise, use our normal $\mathrm{MID}$ formula; however, you need to (a) think about the lottery cutoffs, and not the actual school cutoff, and (b) you can only count schools where you were in the marginal priority group (otherwise we haven’t learned anything about the lottery number).
-
when we’re actually calculating our propensity scores, we need to do something slightly different. the expression for $P(\text{rejected from all } s’ \in B_{\theta s})$ will still be the same — that doesn’t change. however, we now need to think of some different cases for ${P(\text{accepted to $s$}\mid\text{rejected from all }s’ \in B_{\theta s})}$.
- $t_{is} = a$: because $i$ has a better priority than the marginal one at $s$, this probability is just 1.
- $t_{is}=c$: our expression gets to be (mostly) the exact same as before! but again, to look at the lottery number directly, we’ll change out $c_s$ for $\tau_s$.
- $t_{is}=n$: now, there’s no chance of being admitted, so this probability is 0.
in total, this leaves us with:
\[p_s(\theta_i)= \begin{cases} \underbrace{\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{0}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})}=0 & \text{if }t_{is}=n \\ \underbrace{\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{1}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})}=\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right) & \text{if }t_{is}=a \\ \underbrace{\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{\max\left( \frac{\tau_{s}-\mathrm{MID}_{\theta s}^{v(s)}}{1-\mathrm{MID}_{\theta s}^{v(s)}}, 0 \right)}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})} & \text{if }t_{is} = c\end{cases}\](this is written out in Theorem 1 in appendix A.9; the simpler version, where there is only one score type, is equation 2 in MDRD1.)
-
a different way to think about this is to create an augmented economy. this is a helpful proof technique in general with market design (e.g. — it’s used in Pathak + Sönmez (2008)). we’re going to do it via the following process:
- for each school, create a separate “pseudo-school”, one for each priority group. each will have a capacity equal to the number of students who are admitted to it within that priority group. we’ll let school $s$ be split into $s^0, s^1, \dots, s^K$ (we can stop at the last priority group that $s$ has, since the rest will have capacity 0 and won’t matter).
- for each student, augment their priorities as follows: if type $\theta$ has preferences of $s_1 \succ s_2 \succ \dots$, then now have augmented preferences of $s_1^0 \succ s_1^1 \succ \dots \succ s_1^K \succ s_2^0 \succ s_2^1 \succ\dots \succ s_2^K \succ \dots$. from this preference list, restrict it to only include the priority groups that type $\theta$ is eligible for. (this is what allows us to include eligibility for multiple priority groups.),
effectively, we are turning our setting from “applications to schools” to “applications to school-priority groups”.
note that this will get to exactly the same matching as the original market. furthermore, we can directly apply the formula we found out in setting 3! of course, an important note — we have to remember that because all augmented schools with $\rho_{\theta s} < \rho_s$ are exactly at capacity, we’ll set the corresponding augmented schools’ cutoff to $c_{s^k}=1$. similarly, for augmented schools with $\rho_{\theta s}>\rho_s$, the corresponding school will have $c_{s^k}=0$.
after we run MDRD on this augmented economy, we’ll be left with the likelihood that type $\theta$ gets admitted to every combination of school $s$ and priority group $\rho$. this isn’t quite want we wanted — remember, we want to know the likelihood that $\theta$ is admitted to schools $s$.
to get that, we can just add back up all the the p-scores to school-priority combos at school $s$! you can run through the logic and find that this in fact exactly matches up with the formula we created in method 1 just above — if you’re only eligible for one priority group, then only school-priority group can have a non-zero propensity score, and it will exactly match the propensity score from above.
in practice, we usually use the second for a few reasons:
- it turns out there’s a lot of districts that allow for eligibility at multiple priority groups, so we are forced to use method 2.
- it’s much easier to code — the first method relies on a lot of different if statements and casework, whereas the second applies the same logic to every pseudo-school.21
bringing it home
recap + taking theory to data
let’s remember what we were trying to do here, a task we set out to do up above:
- to figure out the causal effect of attending school $s$, we want to run a regression where we instrument attendance with admission.
- to make this a regression that works, we want to control for propensity scores $p_s(\theta)$— the likelihood that a student of a given type $\theta$ is admitted to school $s$.
- over the last several sections, we’ve developed ways of computing this quantity in infinite-sized markets.
- but to actually use this, we need a way of going from this infinite market formula to one that works for the data we actually observe.
we do this in a very natural way — we say that the infinite-market odds are a good approximation for the finite-market odds, and use the formulas derived above to calculate propensity scores. now, there’s actually two different ways of doing this suggested by MDRD:
-
the formula calculation. use the observed cutoffs and marginal priority groups, plug them into the above formula, and say that those are the propensity scores. essentially, just say the infinite-market odds are exactly the same as the finite-market odds, and that the observed cutoffs + marginal priority are the same as in the infinite market. in essence, we just plug everything into equation $(3)$.
-
the frequency calculation.22 recall that equation $(3)$ states
\[p_{s}(\theta) = \underbrace{\prod_v\left(1-\mathrm{MID}_{\theta s}^v\right)}_{P(\text{rejected from all } s' \in B_{\theta s})} \times \underbrace{\max\left( \frac{c_{s}-\mathrm{MID}_{\theta s}^{v(s)}}{1-\mathrm{MID}_{\theta s}^{v(s)}}, 0 \right)}_{P(\text{accepted to $s$}\mid\text{rejected from all }s' \in B_{\theta s})} \tag{3}\]because of this formulation, if two students have the exact same set of $\mathrm{MID}$s at school $s$, then they should always have the exact same propensity score.
because of that, a different way of thinking about calculating these propensity scores through estimation is the following:
- use observed cutoffs to calculate $\mathrm{MID}$s, just as in the formula calculation.
- to calculate $p_s(\theta)$ for student $i$ of type $\theta$, take the set of all students who have the same set of $\mathrm{MID}$s as $i$. take the overall offer rate of those students, and call that $p_s(\theta)$!23
note in some sense, this is perhaps more “true” to the observed propensity scores, because it actually uses the set of all students’ offers, and not just the marginal students’ offers.
in summary, the procedure to run MDRD is
- start with student preferences, school capacities and priorities, and students’ random number draws.
- from this information, calculate the cutoff at each school.
- using either the formula or frequency calculation to calculate propensity scores for each student at each school.
- estimate equation $(*)$ instrumenting using equation $(\dagger)$.
and you’ll get out estimates of causal school effects!
note that we can also use this procedure if we wanted to look at a class of schools. for example, MDRD1 analyzes the charter sector in denver. to do so, they add up the propensity score at each charter school into a singular “charter propensity score”.
to help confirm our faith in these propensity scores, we can perform a balance test. in essence — if these propensity scores are actually good, they should be unrelated to all other controls in our regression, or really any data that came from before assignment. this means that for any control $W_i$ (like gender, race, pre-existing test scores, how many schools they applied to, …), we can estimate the following equation:
\[W_i = \zeta_s D_i(s) + \sum_x \alpha_0(x)d_i(x) + \varepsilon_i \tag{$\ddagger$}\]we should expect $\zeta_s = 0$; intuitively, these propensity scores should mean that we don’t observe any other selection bias.
note also that we can include these controls into the regressions $(\ast)$ and ($\dagger$). we might ask — what’s the point, given that propensity scores absorb all selection bias? well, doing so can increase the precision of our estimates. if we throw in demographic controls and pre-existing test scores, we will reduce the standard errors in our regression, helping us to more precisely identify the causal effects.
why does the mdrd approximation work
at this point, we might ask a natural question: how do we know that the infinite market approximation of propensity scores (i.e., equation $(\S)$), is good enough for a finite market? inspired by WIRED’s “5 levels” video series (my personal favorite is jacob collier’s), i’m going to answer this question a few times, each with a slightly more detailed explanation. i’d encourage you to at least read at least first of these arguments, to get a sense of why this methods work. but as the explanations get more complex, feel free to just skip to the next section.
explanation, level 1
we’re going to think about three different quantities
- in a real market of size $n$ (i.e., $n$ students), the true propensity scores (i.e., odds of assignment)
- using the observed cutoffs of real market of size $n$, what the MDRD formula $(\S)$ would claim the propensity scores are
- in an “infinite market” that’s just copies of the size-$n$ market, what the true propensity scores are (which also is the propensity score given by the MDRD formula $(\S)$).
as you let $n$ grow two things happen:
- the quantity in (2) approaches the quantity in (3)
- the quantity in (1) approaches the quantity in (3)
that means that when $n$ goes to infinity, all three need to be equal. so essentially, in a big enough market, using the MDRD estimates with observed cutoffs (i.e., (2)) will approach the real odds (1). because this happens when $n$ gets big, that means so long as we have enough students applying, the MDRD formula will be a good enough approximation for the true odds, since (2) will be close to (3) and (3) will be close to (1).
explanation, level 2
let’s give those three quantities names.
note: it’s for this section (explaining the approximation) that i depart from the notation used above, and return to the notation that is used in MDRD1. really sorry for the confusion, but i hope that it’s made what happened above just a tad clearer.
- let $p_{ns}(\theta)$ be the true propensity score for type $\theta$ attending school $s$ in a market of size $n$.
- let $\widehat{p}_{ns}(\theta)$ be the estimated propensity score for type $\theta$ attending school $s$ in a market of size $n$ using the estimated cutoffs and the MDRD formula $(\S)$
- let $\varphi_s(\theta)$ be the true propensity score for type $\theta$ attending school $s$ in a market of infinite size. (above, we called this propensity score $p_s(\theta)$.)
what we’re going to do is show two things:
- As $n \to \infty$, \(\widehat{p}_{ns}(\theta) \to \varphi_s(\theta)\) 100% of the time. this is going to be true because as $n \to \infty$, the observed cutoffs in the finite market will become the cutoffs in the infinite market. because \(\widehat{p}_{ns}(\theta)\) is just a function of observed cutoffs, that means it should go to the propensity scores of the infinite market, or \(\varphi_s(\theta)\).
- As $n \to \infty$, \({p}_{ns}(\theta) \to \varphi_s(\theta)\) 100% of the time. this has a similar, but distinct, argument. note that even in finite economies, the cutoff representation of DA holds; however, these cutoffs might change depending on the exact lottery draws. if we try out a lot of different lottery draws when there’s $n$ people, the true propensity score of type $\theta$ at $s$, counting the likelihood of being rejected from schools they like to $s$ ($B_{\theta s}$) but making $s$’s cutoff. these cutoffs might be different in each lottery draw, but if we let $n$ be big enough, then these cutoffs will get arbitrarily close to the infinite-market cutoffs, and so we can get as close to the infinite-market propensity scores as we want.
because these two sequences converge to the same limit, they also must converge to each other.
explanation 3
let’s be a bit more precise about what we mean by markets getting “big”. suppose that types are drawn from some distribution $F$. consider a sequence of markets where types are distributed according $f_n$ (market $n$ having $n$ students). we’ll make this sequence deterministic, deciding in advance what each market of size $n$ will look like. we’ll construct this sequence of markets so that as $n \to \infty$, we have that $f_n \to F$.
we’ve also been playing a bit loose with what an “infinite” market means. if we want to be technical, what i’ve been calling an infinite market is really a continuum market, where students are “atoms”, have scores distributed exactly as $U[0, 1]$, and come from the type distribution $F$. but that’s just a more technical note to help if you read the paper.
our two lemmas we want to prove, more precisely, are:
- As $n \to \infty$, \(\widehat{p}_{ns} (\theta) \xrightarrow{\text{a.s.}} \varphi_s(\theta)\), i.e., converges almost surely. the argument here first involves showing that the cutoff vector in the finite market approaches the infinite-market cutoffs, i.e., \(\widehat{\mathbf{c}}_{n} \xrightarrow{\text{a.s.}} \mathbf{c}\). the argument here is a little complicated, but it revolves around showing that the limit of \(\widehat{\mathbf{c}}_{ns}\) can’t be larger than \(\mathbf{c}_s\) because you can arrive at a contradiction with DA. we use an extra assumption here that we didn’t mention earlier, which is that for each school and priority group, someone ranks it first. then, because this holds, and the MDRD function (equation $(\S)$) is continuous in its arguments (aka, cutoffs), the extended continuous mapping theorem says that if we apply the MDRD function to both sides, the convergence will still hold. that gets us to the desired result.
- As $n \to \infty$, \({p}_{ns}(\theta) \xrightarrow{\text{a.s.}} \varphi_s(\theta)\). here, we show we can find a sufficiently large $N$ where when the market has size $n > N$, with probability 1, the drawn cutoffs will be within $\varepsilon$ of the infinite-market cutoffs. because of that, with probability 1, the true odds can also be arbitrarily close to the infinite-market cutoffs. that gives us the desired convergence.
because of these results, for each school $s$, we have that
\[\left| \widehat{p}_{ns}(\theta) - {p}_{ns}(\theta) \right| \xrightarrow{\text{a.s.}} \left| {\varphi}_{s}(\theta) - {\varphi}_{s}(\theta) \right| = 0\]meaning that convergence always holds. if we also say that we have finite schools and types, this also gives us uniform convergence.
explanation 3 will be my last one — if you want more details, refer to lemma 3 + lemma 4 in MDRD1, which are proved in appendix A.5.
shortcomings of mdrd
now — i also want to draw attention to a specific shortcoming of mdrd as it stands at the end of MDRD1+2.
the methodology that we figured out is a large-market approximation. and so it might make sense to think about how it performs when the number of students is small — the cutoffs might not be accurate, and the MDRD formulas won’t hold exactly. notably, MDRD1 and MDRD2 both do evaluation at the level of sets of schools, rather than individual schools, in part because of this finite sample problem.
as a simple example: consider one school with two applicants, $i_1$ and $i_2$, who draw a random number $r_i \sim U[0, 1]$ to determine who gets admitted. clearly, each student has a $\frac12$ probability of admission. the expected cutoff would be the expectation of the lower number — using results from order statistics, we know that this is $\frac13$. but applying the MDRD formula $(\S)$, we’d say that each student has a propensity score of assignment of $\frac13$. clearly, something’s up here.
however, the fact that propensity scores are off isn’t the worst thing ever. we’ll still control for some OVB, and aggregating across schools limits the magnitude of this problem. in fact, proposition 2 (p. 1398) outlines how the lottery numbers will still be “good enough” if lottery numbers are close to uniform conditional on cutoffs. some of this is also solved in the RC VAM paper (“Credible School Value-Added with Undersubscribed School Lotteries”, Angrist et al. 2024), which shows ways that we can use these MDRD-generated propensity scores to generate school-level value-added measures anyways. but we still can run into these issues, and MDRD might be a long ways off from being fully efficient.
there’s one other shortcoming that i’ll discuss, but that needs to wait until the MDRD2 post!
conclusion
and… we’re done! ish. about 13,000 words later, we’ve gone through MDRD1, the first in a two-part series about doing causal estimation using centralized school choice systems. for systems using deferred acceptance and where each school uses a lottery to admit students, we now know how to find each student’s propensity score at each school. using those propensity scores, we can run a series of regressions to find the causal effect of attendance at each school (or, a whole school sector).
MDRD2 exists to fill in a big gap of MDRD1 has: the ability to handle schools that don’t have lotteries, but still participate in the centralized mechanism. these admissions could be based on test scores, on interviews, on auditions and portfolios. tons of these schools exist in the world, and we have absolutely no way of dealing with them … for now! using the powers of regression discontinuity and all of the things we’ve learned — we’re not too far off from being able to do this to.
but with MDRD1 now explained, it’s time for this blog post to end. stay tuned for the future — i’m hoping (read: making a public promise) to write posts on:
- MDRD 2 (Breaking Ties: Regression Discontinuity Design Meets Market Design (Abdulkadiroğlu, Angrist, Narita, and Pathak, 2022))
- how does deferred acceptance work, and explaining some of its really cool properties (stability, side-optimality, opposing interests, strategy-proofness, rural hospitals)
- also maybe i’ll come back and add in some pictures for this post? sometimes i find they can really help with the explanation
- preference structural estimation things?
we’ll see how well these promises can be kept. i’m hopeful :)
-
i’m hoping to make a “deferred acceptance” mega-post one day, explaining how it works and its various properties, like stability, rural hospitals, side-optimality, opposing interests, etc. ↩
-
don’t feel obligated to memorize all of this notation now; the point of introducing (almost all of) the notation here is so that it can be a reference as you read the post. ↩
-
i.e., students never are indifferent between two schools ↩
-
students have ranks over every school. but, knowing that $s_0$ is the outside option with infinite capacity, a student will never be assigned to a school they like less than $s_0$ — so we can just ignore preferences past that point. ↩
-
we’ll allow for these non-lotto tiebreakers, also known as screens, when we get to MDRD2! ↩
-
note that we don’t need to know school capacities! those are incorporated into the cutoffs themselves ↩
-
note this applies to both schools with excess capacity and schools that fill their capacity exactly without every denying admission to anyone; the latter group is harder to see in the data at a glance. the easiest way to check is to look at the set of students who ranked school $s$; if all of them are assigned to either $s$ or a school they like better than $s$, then $c_s$ = 1. ↩
-
not casual! ↩
-
which i only namedrop if you want more information about it. also, Rosenbaum + Rubin (1983)! ↩
-
to be clear, we will estimate this equation one school at a time. ↩
-
propensity scores are a fancy word for saying “what’s the likelihood that person $i$ was assigned to some ”treatment”, in this case, attending school $s$. ↩
-
doing so means that for some fixed propensity score $x$, the coefficient $a_2(x)$ will be the mean of outcomes of students with propensity score $x$ who didn’t get assigned to $s$. then, $\beta_s$ is taking a weighted average of the effect for each of the different propensity scores we’re bucketing together. ↩
-
this is the LATE (local average treatment effect) interpretation of instrumental variables, introduced by Angrist (of Blueprint Labs!) and Imbens in 1991 (see here). ↩
-
for more formality on this, you can refer to MDRD1 or take a look at Azevedo + Leshno (2016) or the many other papers that look at large-market approximations. ↩
-
at least, the third line of it, and with $\tau$ for $c$ — we’ll get to the differences when we talk about priorities. ↩
-
we didn’t actually need to include an $s_0$ in any of the calculations above; it’s just convenient to have for describing DA as cutoffs. but we could run the whole logic of the sections without $s_0$, in which case, priority scores will not sum to 1. ↩
-
for those of you math-y folks out there, this proof is in the same vein as the probabilistic method. ↩
-
my coding up of MDRD actually actually runs in $O(n(VSK))$ right now, largely because quicker is difficult in Stata and it’s much easier to check intermediate calculations this way. usually, $V$ is quite small, especially in systems with a common lottery draw. we don’t usually care about runtime, but my inner algorithms nerd definitely feels bad for not over-engineering it. alas, i’m an economist and not a SWE, and so we don’t care about runtime. usually. it gets a little bad when every school uses a different lottery number, but that’s for future code refactoring. ↩
-
in MDRD1, these types are given as sets $\Theta_s^a, \Theta_s^c, \Theta_s^n$; i’m following the MDRD2 notation here. ↩
-
the phrasing “isn’t always accepted” is important here; i didn’t use “is assigned with probability strictly between $0$ and $1$” for a reason: it is the case that if $t_{is} = c$, then $i$ might never be accepted by $c$. this can happen because of two reasons. (a) at some school that $s’$ that $i$ likes more than $s$ (i.e., $s’ \succ_i s$), $t_{is’} = a$ (i.e., they’re better than the marginal priority at $s’$. if this is the case, then the worst that $i$ can do is be assigned to $s’$, since $s’$ would never reject them. that means they never get assigned to $s’$. (b) at some school that $s’$ that $i$ likes more than $s$ (i.e., $s’ \succ_i s$), $t_{is’} = c$ (i.e., they’re in the marginal priority at $s’$, $v(s) = v(s’)$ (i.e., they use the same score), and $c_{s’} > c_s$ (i.e., $c_{s’}$ has a more lenient cutoff). if this is the case, then if $i$ gets rejected from $s’$, we know that their random draw $r_i^{v(s)}$ must be worse than $c_{s’}$, and so $i$ can never meet the cutoff $c_s$. an alternative way to say this is that under this circumstance, we know that $\mathrm{MID}_{\theta s}^{v(s)} > c_s$. ↩
-
in fact, the second method is what i use to code up mdrd. ↩
-
i’m actually going to explain this in a slightly way than MDRD1 does. refer to p. 1392 and the table at the top of p. 1393 for the original explanation. ↩
-
if we have priority groups, and we’re using way #1 to deal with them (where we have the $a$, $c$, and $n$ assignment groups, then we shouldn’t compare all students who have the same set of $\mathrm{MID}$s; we also should only compare students with the same $\mathrm{MID}$s who also belong to the same assignment group (either $a$, $c$, or $n$). we get around this issue using way #2 of dealing with priority groups because we don’t use these $a$/$c$/$n$ designations :) ↩
Comments