Inference

Table of Contents

A gap is only evidence if it would be unusual to find one this large by chance. The placebo test asks exactly that.

At a Glance
#

The results phase produced two estimated effects: a large post-1993 gap for Black men and a smaller but real one for white men. An estimate is not yet a result. The question this phase answers is whether gaps of those sizes are unusual, or whether the method would manufacture a gap that large for any state if you asked it to.

Synthetic control answers this with a placebo permutation. The procedure refits the entire method pretending each donor state in turn is the treated unit, reassigning the 1993 treatment to a state that was never expanded. Each placebo state gets its own synthetic control and its own post-1993 gap. The collection of placebo gaps is the distribution of effects the method produces under no real treatment, and the real Texas estimate is meaningful only if it sits at the extreme of that distribution.

The Placebo Permutation
#

The figure below shows the Black-male prisoner gap for Texas in heavy color against the same gap computed for every donor state in grey, each treated as a placebo. If the Texas line is buried in the grey, the estimated effect is indistinguishable from the gaps the method produces by chance. If it stands clear of the grey, the effect is unusual.

Placebo Inference: Texas Against Every Donor State

Each grey line reassigns the 1993 treatment to a donor state, on the raw prisoner gap used for the permutation test. Texas posts the largest post-to-pre MSPE ratio of all 49 units (rank 1).

The Texas line sits at the edge of the placebo distribution after 1993, which is the visual form of the test. The numeric form is the mean squared prediction error ratio. For any unit, define the pre- and post-treatment mean squared prediction error as the average squared gap in each window:

\[ \text{MSPE}_{\text{pre}} = \frac{1}{T_{\text{pre}}} \sum_{t \le 1993} \big( Y_t - \hat{Y}_t \big)^2, \qquad \text{MSPE}_{\text{post}} = \frac{1}{T_{\text{post}}} \sum_{t > 1993} \big( Y_t - \hat{Y}_t \big)^2 \]

where:

MSPE is mean squared prediction error: the average squared gap between a unit and its synthetic
\(\text{MSPE}_{\text{pre}}\) measures fit before 1993; \(\text{MSPE}_{\text{post}}\) measures it after
\(Y_t\) is the observed count in year \(t\) and \(\hat{Y}_t\) is the synthetic count
\(T_{\text{pre}}\) and \(T_{\text{post}}\) are the number of years in each window, so each term is an average

The test statistic is the ratio of the two:

\[ r = \dfrac{\text{MSPE}_{\text{post}}}{\text{MSPE}_{\text{pre}}} \]

where:

\(r\) is the test statistic: how much worse a unit fits after 1993 than before
A large \(r\) means a tight pre-period fit followed by a big post-period departure, the signature of a real effect
Dividing by the pre-period error is what keeps the statistic fair: a unit that never fit well cannot post a high ratio just by being noisy

A large \(r\) means a unit that fit well before treatment and badly after, which is the signature of a treatment effect rather than a poor synthetic. Dividing by the pre-period error is what makes the statistic fair across units: a state whose synthetic never fit well has a large denominator and cannot post a high ratio just by being noisy. Ranking every unit by \(r\) places the real treated state in the placebo distribution.

What the Ranking Says
#

Texas posts the largest MSPE ratio of all forty-nine units for both outcomes. Its Black-male ratio ranks first of forty-nine, and so does its white-male ratio.

The permutation p-value formalizes what “rank first” buys. With \(J+1\) units in the pool (the treated unit plus \(J\) donors), the p-value is the share of units whose MSPE ratio is at least as large as the treated unit’s:

\[ p = \dfrac{1}{J+1} \sum_{j=1}^{J+1} \mathbb{1}\big( r_j \ge r_{\text{TX}} \big) \]

where:

\(p\) is the permutation p-value: the share of all units whose MSPE ratio is at least as large as Texas’s
\(J+1\) is the total number of units (Texas plus \(J\) donors), here 49
\(r_j\) is unit \(j\)’s MSPE ratio and \(r_{\text{TX}}\) is Texas’s
\(\mathbb{1}(\cdot)\) counts 1 when the condition holds and 0 otherwise; when Texas ranks first the sum is 1, giving \(p = 1/49\)

where \(\mathbb{1}(\cdot)\) is the indicator function and \(r_{\text{TX}}\) is Texas’s ratio. When Texas has the single largest ratio, the sum equals one (only Texas satisfies the inequality), so \(p = 1/(J+1) = 1/49 \approx 0.02\). That is the smallest p-value the placebo distribution can produce at this pool size: with forty-nine units, no result can be more extreme than rank one, and both outcomes reach it.

This is the result the framing has to respect. On the question of whether an effect exists, there is no asymmetry: both the Black-male and the white-male effects are at the extreme of their placebo distributions, both as unlikely to be chance as the test can register. The asymmetry is entirely in magnitude, where the Black-male effect runs about a fifth larger proportionally and about half again larger in people. The data supports “both effects are real, and the Black-male effect is larger,” and it does not support “the effect is real for Black men and absent for white men.”

How Much the Specification Matters
#

A fair objection to the specification ladder is that choosing “the richest spec that solves” could, in principle, be a way to land on a preferred answer. The defense is to show the answer barely moves as the specification changes. Refitting the post-1993 proportional gap on every rung of the ladder, from the full seven-covariate fit down to a bare specification matching only on outcome lags, gives this:

Rung	Specification	Black Gap	White Gap
1	7 Covariates, 3 Lags (Chosen)	66%	55%
2	7 Covariates, 1 Lag	67%	56%
3	4 Covariates, 2 Lags	66%	55%
4	3 Covariates, 1 Lag	67%	55%
5	Lags Only, 3 Lags	66%	59%
6	Lags Only, 2 Lags	65%	51%

The Black-male gap holds between 65 and 67 percent across every rung. The white-male gap is slightly more sensitive, ranging from 51 to 59 percent, but it never collapses toward zero and never approaches the Black-male figure. The chosen specification is not a lucky draw; it sits in the middle of a tight band. The unequal-burden finding survives the choice of predictors.

What This Design Can and Cannot Establish
#

Every analytical claim has limits worth naming, and synthetic control has specific ones.

The pre-period is short, but the pre-fit is tight. The weights are fit on eight years, 1986 to 1993. A skeptic might worry that Texas ranks first only because its post-period error is large, not because its pre-fit is good, since a unit that was always a poor match would post a high ratio for the wrong reason. The pre-period MSPE rules that out. Texas’s pre-treatment fit error sits at the 4th percentile of the donor pool for Black men and the 19th for white men, meaning the synthetic tracks Texas more tightly before 1993 than almost every placebo does for its own state. The rank-1 result is driven by a genuinely close pre-fit followed by a real post-1993 departure, not by a loose match inflating the ratio. The white-male pre-fit, at the 19th percentile, is looser than the Black-male fit, but the test design absorbs this: because the ratio divides post-period error by pre-period error, a looser pre-fit raises the denominator and works against a high ratio, not for it. That white still ranks first of forty-nine despite the looser pre-fit is the test confirming the effect rather than an artifact of the fit. Eight years is still a short pre-period, and a longer one would sharpen the counterfactual further, but the fit it produces is among the tightest in the pool.

This is an association with a credible counterfactual, not a randomized experiment. Synthetic control constructs the most defensible available comparison and tests it against placebos, which is far stronger than a raw before-and-after. It is not assignment by lottery. The honest verb throughout is that the expansion is “associated with” the estimated increase, and the placebo ranking is the evidence that the association is unlikely to be noise, not proof of the mechanism that produced it.

The estimate captures the expansion together with anything else that hit Texas in 1993 and nothing else. Synthetic control attributes the post-1993 gap to the treatment, but it cannot separate the capacity expansion from any other Texas-specific shock that arrived in the same window. The expansion is the largest and best-documented candidate, which is why it carries the attribution, but the design measures the net departure from the counterfactual, not the expansion in isolation.

The mechanism behind the unequal burden is outside the data. The estimates establish that the increase fell more heavily on Black men. They do not explain why a race-neutral capacity increase produced a race-uneven result. That question, how supply interacts with the charging, sentencing, and parole decisions that actually fill cells, is real and important and lives beyond what these counts can answer. Naming it is part of the result; resolving it is not something this design can do.

The racial categories are the source data’s, with its limitations. “Black” and “white” here are the classifications as recorded in the underlying Bureau of Justice Statistics prisoner data, not categories this analysis defined. They carry that data’s limitations. The counts do not resolve Hispanic ethnicity, which BJS tracked separately and inconsistently across this period, so a prisoner counted as Black or white may also be Hispanic; and “male” reflects the sex recorded in the administrative data. The analysis can only measure the categories the source actually recorded, and the unequal-burden finding is a statement about those recorded categories, not about race as a fuller social reality.

The full fit, including the predictor set, the donor pool, and the figure generation, is reproducible: every number in this case study comes from fitting the documented specification against the texas panel, and anyone fitting the same specification gets the same estimates, or the case study fails its own reproducibility standard.

Sources
#

The method originates with Abadie, Diamond, and Hainmueller (2010), “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program,” Journal of the American Statistical Association 105 (490): 493 to 505. The estimator and its placebo-permutation inference both follow that paper.

The Texas prison-capacity data is the texas panel from Scott Cunningham’s Causal Inference: The Mixtape (Yale University Press, 2021), available online at mixtape.scunning.com.¹ The dataset is distributed in R through the causaldata package, which is how this analysis loads it.

The fit uses the tidysynth package (Dunford), a tidyverse-style interface to the synthetic control estimator.

mixtape.scunning.com, Scott Cunningham. ↩︎

At a Glance#

The Placebo Permutation#

What the Ranking Says#

How Much the Specification Matters#

What This Design Can and Cannot Establish#

Sources#

At a Glance
#

The Placebo Permutation
#

What the Ranking Says
#

How Much the Specification Matters
#

What This Design Can and Cannot Establish
#

Sources
#