Estimating Treatment Effects in Panel Data Without Parallel Trends
Abstract: This paper proposes a novel approach for estimating treatment effects in panel data settings, addressing key limitations of the standard difference-in-differences (DID) approach. The standard approach relies on the parallel trends assumption, implicitly requiring that unobservable factors correlated with treatment assignment be unidimensional, time-invariant, and affect untreated potential outcomes in an additively separable manner. This paper introduces a more flexible framework that allows for multidimensional unobservables and non-additive separability, and provides sufficient conditions for identifying the average treatment effect on the treated. An empirical application to job displacement reveals substantially smaller long-run earnings losses compared to the standard DID approach, demonstrating the framework's ability to account for unobserved heterogeneity that manifests as differential outcome trajectories between treated and control groups.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about a better way to measure the effect of a “treatment” (like losing a job, starting a program, or a policy change) when we have data that follow the same people over time. The usual method, called difference-in-differences (DID), assumes the treated and untreated groups would have followed the same path over time if nobody had been treated (this is called “parallel trends”). The paper shows how to estimate treatment effects even when that assumption isn’t true.
Main Questions
- How can we estimate what would have happened to treated people if they had not been treated, without assuming parallel trends?
- Can we allow for more realistic, complicated “hidden differences” between people that change over time?
- Does this new method change what we conclude in a real example: the effect of job loss on earnings?
How the Method Works (Everyday Explanation)
Think of each person as having hidden traits that affect their outcome (like earnings). These traits can be complicated and multi-dimensional, and they can affect outcomes in non-simple ways.
The common DID method assumes something very simple about these hidden traits: they act like a fixed personal “level” plus a general time effect, and that’s enough to make trends parallel. But in real life, people’s hidden traits can do more than that, so the parallel trends assumption can be wrong.
Here’s the key idea of the paper, explained with analogies:
- Multiple “noisy snapshots” of hidden traits:
- Before treatment happens, we often observe the untreated outcome for several periods (for example, several years of pre-treatment earnings). Each of these is like a blurry photo of a person’s hidden traits. One blurry photo isn’t enough to see the true face, but several different blurry photos can be combined to reconstruct it.
- Using enough “angles” to recover what’s hidden:
- With enough pre- and/or post-treatment periods, and some mild technical conditions, the method can recover how hidden traits are distributed and how they relate to outcomes, even though we never observe the traits directly. This is similar to using many camera angles to build a 3D picture.
- Fair comparisons need overlap:
- For any type of person (in terms of hidden traits), there must be some who are treated and some who are not. Otherwise, we can’t compare like with like.
In practice, the method:
- Uses the repeated untreated outcomes (pre-treatment periods and untreated post-treatment outcomes for the control group) as multiple “noisy measurements” of each person’s hidden traits.
- Learns how outcomes depend on these hidden traits from the control group (who never receive the treatment at that time).
- Figures out how the hidden traits are distributed in both the treated and control groups.
- Combines this information to estimate what the treated group’s outcomes would have been without treatment, and then compares that to what actually happened.
Important requirement: the number of hidden traits we try to learn can’t be bigger than the number of pre- (or post-) periods we have. So, this works best when you have several time periods of data before and/or after treatment.
What Did They Find?
The paper applies the method to a classic question: What happens to people’s earnings after they lose their jobs?
- Using standard DID (which assumes parallel trends), long-run earnings losses after job loss look large.
- Using the new method (which does not require parallel trends and better accounts for hidden differences and their changing effects), the long-run losses are much smaller.
- In fact, about nine years after job loss, the estimated earnings reduction is roughly half of what standard DID suggests.
Why this matters: If treated and control groups were on different paths even before treatment (different “growth trajectories”), standard DID can mistake that difference for a treatment effect. The new method corrects for that by using the extra time periods to better understand hidden differences.
Extra Capabilities
- Beyond averages: The method can estimate how the whole distribution of outcomes changes (for example, effects at the 10th, 50th, or 90th percentile), not just the average effect.
- Heterogeneity: It can explore how treatment effects vary across different types of people (based on their hidden traits).
- Staggered timing: It can be adapted to settings where different groups get treated at different times, with some care in how groups are chosen.
Why It Matters
- More realistic: It allows for complex, multi-dimensional hidden differences and doesn’t force the “parallel trends” story.
- More accurate: It can prevent over- or under-estimating treatment effects when groups were already moving differently before treatment.
- Practical guidance:
- You need rich data with several time periods (especially pre-treatment) to get reliable results.
- The method is more complex to estimate than standard DID, but it can greatly improve credibility.
- Policy impact: Decisions based on more accurate effect sizes (like the true long-run cost of job loss) can lead to better policy design and targeting.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and open questions that remain unresolved and could guide future research:
- Strength of conditional independence across blocks: The core assumption that Ypre(0), (Y0(0), D), and Ypost(0) are mutually independent given U is restrictive; practical diagnostics, falsification tests, or sensitivity analyses tailored to this block-independence structure are not developed.
- Allowing limited cross-block dependence: Identification under weaker dependence structures (e.g., allowing small residual serial correlation across blocks beyond U) is not characterized.
- Completeness conditions in practice: The paper relies on abstract completeness conditions (pre and post) that are hard to verify; primitive sufficient conditions linked to observable features (e.g., support, tail behavior, factor-loading variation) and data-driven tests for completeness are missing.
- Dimensionality of unobservables: There is no procedure to select or estimate the dimension K of U from data; consequences of K mis-specification (over/under) for bias and identification are not analyzed.
- Partial identification when completeness fails: No bounds or sensitivity sets are provided for ATT/QTT when completeness does not hold or when K > min(T_pre, T_post).
- Mixed and discrete outcomes: Identification and estimation rely on densities and boundedness; extensions to mixed discrete-continuous outcomes (e.g., mass points at zero earnings, as in the application) and to censoring/top-coding are not developed.
- Censoring/top-coding in SIAB: The empirical setting has top-coded earnings; the identification strategy does not incorporate censoring mechanisms or propose corrections compatible with the Hu–Schennach approach.
- Explicit role of observed covariates X: Although assumptions are “conditioned on covariates,” the identification and estimation algorithms do not explicitly incorporate high-dimensional X; methods for orthogonalization/double robustness or ML-based nuisance adjustment are not provided.
- Inference for multi-step/sieve estimators: Asymptotic distribution theory, variance estimation, and valid standard errors (e.g., bootstrap schemes robust to ill-posed inversion) for the proposed estimators are left undeveloped.
- Finite-sample performance: There is no Monte Carlo evidence quantifying finite-sample bias/variance and robustness under realistic DGPs (factor models, HMM, heteroskedastic shocks) or under mild violations of key assumptions.
- Construction of the normalization functional M: The paper assumes a known functional M with M[f_{Ypre(0)|U}(·|u)] = u, but provides no constructive guidance or examples for implementing M in practice across model classes.
- Overlap diagnostics in U: Beyond Assumption 2, practical diagnostics for overlap in the recovered distribution of U across D=1 and D=0 (and trimming rules when overlap is weak) are not provided.
- Anticipation and dynamic selection: The framework allows correlation between Y0(0) and D but not between Y_{t<0}(0) and D given U; methods to handle anticipation effects that contaminate multiple pre-periods are not explored.
- Staggered adoption with time-varying unobservables: The paper notes limitations but does not offer identification strategies when cohort definitions induce selection on evolving shocks (e.g., using augmented state variables, instruments, or alternative conditioning sets).
- Multi-valued/continuous and dynamic treatments: Extensions beyond binary one-time treatment (e.g., intensity, dosage, repeated spells) are not analyzed.
- Robustness to nonlinear outcome transformations: While claiming robustness to transformations, the paper does not formalize transformation-invariance conditions or provide guidance on which transformations preserve identification.
- Common shocks and cross-sectional dependence: The i.i.d. across units assumption and large-N fixed-T framework ignore clustered shocks/spatial dependence; identification and inference under correlated unit-level shocks are not addressed.
- Attrition and observation process: Assigning zero earnings when not observed conflates non-coverage with true zero; implications for identification (especially the density/boundedness requirements and block independence) are not analyzed.
- SUTVA and spillovers: Potential spillovers (e.g., displacement affecting labor market conditions for controls) are not discussed; conditions and corrections for interference are not provided.
- Scalable estimation under high K: Practical templates for parsimonious parametric/semiparametric specifications that remain faithful to identification while scaling to larger K (including regularization/penalization strategies) are not offered.
- Choosing between Assumption 4 (Hu–Schennach path) and Assumption 5 (direct model-based path): Empirically implementable criteria, overidentification tests, or model selection procedures are not proposed.
- Sensitivity analysis for key assumptions: Bias formulas or sensitivity parameters for violations of (i) block-independence, (ii) completeness, (iii) overlap, and (iv) normalization are not developed.
- Unbalanced panels and missing periods: The framework assumes balanced panels; extensions to intermittent missingness (beyond assigning zeros) with identification-preserving conditions are not treated.
- Treatment misclassification: The consequences of mismeasured treatment status D and corrections compatible with the measurement-error identification scheme are not explored.
- Guidance on data design: Practical advice on the minimum number and spacing of pre- and post-periods (relative to plausible K) and on how to design panels to satisfy completeness and independence conditions is absent.
- External validity of the application: The empirical illustration uses a highly selected subsample (male, 30–39, West Germany, specific education); how results generalize across demographics, sectors, and macro environments is not evaluated.
- Distribution of individual treatment effects: While E[Y(1)−Y(0)|U] is shown identifiable under stronger assumptions, identification of the full distribution of Y(1)−Y(0) without rank invariance (e.g., via additional measurements/instruments) remains open.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed now, leveraging the paper’s identification and estimation framework for treatment effects in panel data without the parallel trends assumption.
- Re-evaluate policy impacts where parallel trends are suspect (policy; labor, education, health)
- What: Re-estimate ATTs (and QTTs) for programs like job training, education reforms, hospital process changes, minimum wage changes, and tax credits using multi-period panels.
- Why: The method corrects for selection on multidimensional, time-varying unobservables that produce differential pre-trends.
- Tools/products/workflows:
- Implement ML/sieve likelihood for f(Ypre|U), f(Ypost|U), f(Y0,D|U), then compute ATT via the paper’s bias-correction equation.
- Provide an analyst-facing function in R/Python/Stata (e.g., “did_noPT()”) that outputs ATT, QTT, and contrasts with standard DID.
- Include a workflow: (i) pre/post window selection, (ii) check K ≤ min(Tpre, Tpost), (iii) estimate, (iv) diagnostic contrasts with standard DID, (v) placebo checks.
- Assumptions/dependencies:
- Rich panel (multiple pre and/or post periods), large N.
- Nondeterministic treatment given U; support overlap across D=0/1.
- Conditional independence of pre, ref-period (t=0), and post untreated outcomes given U.
- Completeness conditions and/or a correctly specified (semi)parametric model to identify f(·|U).
- Computational feasibility if K is small.
- More credible firm- or agency-level impact evaluations when units have different growth paths (industry; HR/operations, marketing, product)
- What: Measure causal effects of layoffs, scheduling rules, pricing policies, or process changes on KPIs when treated and control units show diverging pre-trends.
- Why: The approach treats pre-treatment outcomes as repeated noisy measurements of latent heterogeneity U, avoiding ad hoc unit-specific linear trends.
- Tools/products/workflows:
- Analytics module embedded in BI stacks to estimate causal impact with repeated pre-periods (e.g., in marketing mix or HR analytics).
- Side-by-side dashboards showing DID vs “no-parallel-trends DID” with confidence intervals and QTT.
- Assumptions/dependencies:
- Stable measurement over time; sufficient pre-periods relative to latent dimensionality.
- Independence of idiosyncratic shocks across pre/ref/post blocks given U.
- Robust A/B/eXperiment analysis in settings with panel outcomes and drift (software/tech platforms, e-commerce)
- What: Estimate short- and medium-run effects of product features or algorithm changes when cohorts exhibit nonparallel trajectories.
- Why: Reduces bias from cohort drift and latent user heterogeneity not captured by additive models.
- Tools/products/workflows:
- Experiment-service add-on supporting panel-based inference with repeated pre-periods, providing ATT/QTT estimates and heterogeneity summaries by inferred U bins.
- Assumptions/dependencies:
- Randomization can coexist with time-varying adoption or exposure; need multiple pre-outcome measurements per unit.
- K chosen small; completeness or a suitable factor/latent model is identified.
- Event-study style applications with heterogeneous trends (finance; corporate finance, risk)
- What: Evaluate impacts of events (e.g., regulatory shocks, product recalls) on firm outcomes where treated firms have different pre-trends from controls.
- Tools/products/workflows:
- Estimation templates that replace unit-trend DID with latent-heterogeneity correction, including QTT for tail-risk implications.
- Assumptions/dependencies:
- Sufficient pre/post windows; specification of latent factor structure or hidden Markov-style dynamics if applicable.
- Distributional impact reporting via QTT under the same assumptions as ATT (policy, education, health, marketing)
- What: Report how effects vary across the outcome distribution (e.g., which patients/students/consumers benefit most).
- Tools/products/workflows:
- Automated QTT computation alongside ATT; standardized plots of counterfactual distributions F(Yt(0)|D=1) and realized F(Yt(1)|D=1).
- Assumptions/dependencies:
- Same as ATT identification; no extra distributional restrictions needed beyond those already required.
- Better practice for using pre-treatment outcomes as covariates (academia, industry analytics)
- What: Replace ad hoc inclusion of pre-outcomes as regressors (which are noisy) with the paper’s measurement-error-based framework.
- Tools/products/workflows:
- “Pre-outcomes-as-measurements” estimator templates; documentation contrasting with naive conditioning on Ypre.
- Assumptions/dependencies:
- Identification relies on completeness or correctly specified latent structure; enough pre-periods.
- Workforce policy planning using revised displacement effects (policy; labor agencies)
- What: Update long-run earnings loss estimates from job displacement to adjust UI reserves, retraining budgets, and counseling intensity.
- Tools/products/workflows:
- Forecast modules that plug the paper’s revised ATTs into budget simulations and cost–benefit analyses.
- Assumptions/dependencies:
- Administrative panels with long horizons; careful definition of control groups consistent with potential outcomes.
- Teaching and replication in empirical courses and labs (academia)
- What: Incorporate replication exercises showing when standard DID fails and how the proposed framework changes conclusions.
- Tools/products/workflows:
- Course notebooks with ready-to-run estimation and diagnostics; side-by-side DID vs no-PT-DID comparisons.
- Assumptions/dependencies:
- Sample datasets with adequate Tpre/Tpost and documented assumption checks.
Long-Term Applications
The following use cases likely require further methodological development, scaling, or software maturation before broad deployment.
- Turnkey software suite with diagnostics for completeness and model selection (software; research/enterprise analytics)
- What: A robust package that:
- Automates choice among identification paths (Hu-Schennach, nonlinear factor, hidden Markov) based on data features.
- Offers diagnostics for completeness, support overlap, and conditional independence of blocks.
- Provides uncertainty quantification under misspecification and finite-sample corrections.
- Dependencies/assumptions:
- Research on practical diagnostics for completeness and block-independence.
- Efficient high-dimensional density or factor-model estimation.
- Scalable, high-dimensional latent structure with limited pre/post periods (cross-sector)
- What: Methods that relax K ≤ min(Tpre, Tpost) via structural shrinkage, Bayesian priors, or representation learning for U.
- Potential tools:
- Neural/sieve hybrids that regularize f(Y|U) and recover U under weaker conditions.
- Dependencies/assumptions:
- Theory for identification with learned representations; guarantees under finite T.
- Staggered adoption with endogenous timing and dynamic unobservables (policy, industry rollouts)
- What: Identification and estimation workflows robust to selection induced by excluding cohorts treated mid-window (as flagged in the paper).
- Potential tools:
- Algorithms that jointly model timing and outcomes using richer latent states (e.g., time-varying U via state-space models).
- Dependencies/assumptions:
- New identification results handling time-varying unobservables and event exclusion.
- Handling attrition, missingness, and sample selection in administrative panels (policy, health, labor)
- What: Extensions that integrate selective missingness into the measurement framework (e.g., earnings dropping to zero due to sector transitions).
- Potential tools:
- Joint models of outcomes and observation processes; sensitivity analysis modules.
- Dependencies/assumptions:
- Nonignorable missingness identification strategies compatible with the measurement approach.
- Real-time or streaming panel analytics with rolling windows (tech platforms, energy, IoT)
- What: Near-real-time causal monitoring when pre/post windows evolve and distributions shift.
- Potential tools:
- Online EM/variational methods for latent-U models; drift detection that preserves identification.
- Dependencies/assumptions:
- Stable measurement channel for Ypre; adaptive window selection preserving K ≤ min(Tpre, Tpost).
- Sector-specific structural integrations
- Healthcare: Integrate with hospital quality dashboards to evaluate staggered clinical pathway rollouts under heterogeneous trends.
- Education: Combine with student growth models to estimate curriculum impacts where pre-test distributions drift.
- Energy: Evaluate time-of-use pricing or DER incentive pilots with customer-level heterogeneity and nonparallel load trends.
- Finance: Regulatory impact toolkits that account for firm-specific trajectories (beyond additive trends).
- Tools/products/workflows:
- Domain-tuned latent models (e.g., hidden Markov demand/load processes; nonlinear factor achievement models).
- Dependencies/assumptions:
- Domain-specific measurement models; validation datasets to justify conditional independence structure.
- Heterogeneity mapping and targeting via ETE|U
- What: Use identified conditional treatment effects to target subgroups for whom interventions are most cost-effective.
- Potential tools:
- U-score calculators and policy targeting simulators that respect identification limits.
- Dependencies/assumptions:
- Extended assumptions for treated outcomes (as in the paper’s Assumptions 6–7); validated mapping from U to observable proxies for implementation.
- Practitioner playbooks and standards for “no-parallel-trends DID” (standards bodies, journals)
- What: Best-practice checklists for assumption justification, window choice, model path selection, and reporting ATT/QTT/heterogeneity.
- Dependencies/assumptions:
- Consensus on diagnostics and transparency norms; simulation libraries for stress-testing.
In all applications, feasibility hinges on:
- Data richness: multiple pre/post periods and large N.
- Overlap and nondeterministic treatment given U.
- Credible conditional independence across pre/ref/post blocks given U.
- Either completeness conditions or a correctly specified (semi)parametric latent structure.
- Computational tractability as K grows, often necessitating structural constraints or regularization.
Glossary
- Additively separable model: A specification where components affecting an outcome enter as a simple sum of unit, time, and error terms. "additively separable model of untreated potential outcomes:"
- Ashenfelter's dip: A documented pre-treatment decline in earnings for individuals who later receive training or experience displacement. "âAshenfelterâs dip.â"
- Average treatment effect (ATE): The population-wide mean effect of a treatment on outcomes. "It enables the recovery of the population average treatment effect (ATE)"
- Average treatment effect on the treated (ATT): The mean effect of a treatment among those who actually receive the treatment. "I focus on identifying the average treatment effect on the treated (ATT),"
- Average treatment effect on the untreated (ATU): The mean effect the treatment would have had on those who did not receive it. "it allows identification of the average treatment effect on the untreated (ATU):"
- Balanced panel: A panel dataset where all units are observed in all time periods. "I consider a nonstaggered DID setting with a balanced panel."
- Changes-in-changes: An identification framework allowing for nonseparable models by tracking changes in the distribution over time. "relax additivity by allowing nonseparable models under a changes-in-changes framework"
- Conditional average treatment effect: The mean treatment effect conditional on covariates or latent variables. "identifies the conditional average treatment effect (of displacement on earnings periods later)."
- Conditional heteroskedasticity: Variance of the error term depends on covariates or latent variables. "must exhibit conditional heteroskedasticity"
- Conditional Independence: An assumption that variables become independent once conditioning on latent variables. "Conditional Independence"
- Conditional parallel trends: A version of parallel trends that holds within strata defined by covariates. "Under a standard conditional parallel trends assumption"
- Completeness: A property of conditional distributions ensuring unique inversion in integral equations for identification. "requires completeness of $f_{\boldsymbol{Y}^{\text{pre}(0)|U}$."
- Difference-in-differences (DID): An empirical method comparing changes over time between treated and control groups to estimate causal effects. "The difference-in-differences (DID) approach is one of the most widely used methods"
- Doubly-robust method: An estimation approach that remains consistent if either the outcome model or treatment model is correctly specified. "I estimate the parameter using a doubly-robust method"
- Factor loading: Coefficients linking latent factors to observed outcomes in factor models. " is a vector of factor loading"
- Hidden Markov model: A model where observed outcomes depend on a latent state that evolves according to a Markov process. "such as hidden Markov models."
- Idiosyncratic error: A random shock specific to an individual and time period, unrelated to treatment. "is an idiosyncratic error uncorrelated with treatment."
- i.i.d. (independent and identically distributed): Random variables that are statistically independent and share the same distribution. "i.i.d. across units"
- Large-N, fixed-T framework: Asymptotic setting with many units (N large) but a limited number of time periods (T fixed). "adhering to a large-, fixed- framework."
- Latent variable methods: Techniques that model unobserved variables influencing observed outcomes. "This paper also relates to the literature on latent variable methods in econometrics."
- Linear factor model: An outcome model where observed data are linear functions of latent factors and loadings. "A linear factor model is typically specified as:"
- Nonclassical measurement error: Measurement error that may be correlated with true values or other variables, violating classical assumptions. "nonclassical measurement error models"
- Nonparametric identification: Establishing causal parameters without imposing specific functional forms. "under nonparametric conditions without requiring parallel trends to hold"
- Nonseparable model: A model where the effects of latent variables and shocks interact non-additively. "allowing nonseparable models under a changes-in-changes framework"
- Nonstaggered DID setting: A design where all treated units begin treatment at the same time rather than at varying times. "I consider a nonstaggered DID setting"
- Normalization: Fixing the scale or mapping of a latent variable for identification purposes. "admits a normalization using a known functional "
- Partial identification: Bounding, rather than point-estimating, causal effects when full identification is not possible. "Another recent approach involves partial identification techniques"
- Potential outcomes: Conceptual outcomes that would be observed under different treatment states for the same unit. "Let represent a potential outcome"
- Quantile treatment effect on the treated (QTT): The effect of treatment on specific quantiles of the outcome distribution among treated units. "The quantile treatment effect on the treated (QTT) is useful for understanding"
- Selection on unobservables: Differences in outcome due to factors not observed by the researcher that are correlated with treatment. "The second term captures selection on unobservables"
- Sieve approximations: Flexible, series-based approximations used to estimate complex functions nonparametrically. "represented using sieve approximations."
- Single-index specification: A model where latent variables affect outcomes through a single scalar index. "single-index specifications"
- Stacked estimation approach: Pooling multiple cohorts or time windows into a single dataset for uniform estimation. "I employ a stacked estimation approach"
- Staggered adoption settings: Designs where units adopt treatment at different times, allowing group-time specific effects. "staggered adoption settings"
- Support (of a distribution): The set of values where a random variable has positive probability density. "distributed on the support "
- Top-coding: Censoring high values of a variable at a maximum threshold in administrative or survey data. "Annual earnings are subject to top-coding at the social security contribution ceiling."
- Unit-specific linear trend specification: A model allowing each unit to have its own linear time trend. "A special case of this model is the unit-specific linear trend specification"
Collections
Sign up for free to add this paper to one or more collections.