# Set identification

In statistics and econometrics, set identification (or partial identification) extends the concept of identifiability (or "point identification") in statistical models to situations where the distribution of observable variables is not informative of the exact value of a parameter, but instead constrains the parameter to lie in a strict subset of the parameter space. Statistical models that are set identified arise in a variety of settings in economics, including game theory and the Rubin causal model. Though the use of set identification dates to a 1934 article by Ragnar Frisch, the methods were significantly developed and promoted by Charles Manski starting in the 1990s.[1] Manski developed a method of worst-case bounds for accounting for selection bias. Unlike methods that make additional statistical assumptions, such as Heckman correction, the worst-case bounds rely only on the data to generate a range of supported parameter values.[2]

## Definition

Let $\displaystyle{ \mathcal{P}=\{P_\theta:\theta\in\Theta\} }$ be a statistical model where the parameter space $\displaystyle{ \Theta }$ is either finite- or infinite-dimensional. Suppose $\displaystyle{ \theta_0 }$ is the true parameter value. We say that $\displaystyle{ \theta_0 }$ is set identified if there exists $\displaystyle{ \theta \in \Theta }$ such that $\displaystyle{ P_\theta \neq P_{\theta_0} }$; that is, that some parameter values in $\displaystyle{ \Theta }$ are not observationally equivalent to $\displaystyle{ \theta_0 }$. In that case, the identified set is the set of parameter values that are observationally equivalent to $\displaystyle{ \theta_0 }$.[1]

## Example: missing data

This example is due to (Tamer 2010). Suppose there are two binary random variables, Y and Z. The econometrician is interested in $\displaystyle{ \mathrm P(Y = 1) }$. There is a missing data problem, however: Y can only be observed if $\displaystyle{ Z = 1 }$.

By the law of total probability,

$\displaystyle{ \mathrm P(Y = 1) = \mathrm P(Y = 1 \mid Z = 1) \mathrm P(Z = 1) + \mathrm P(Y = 1 \mid Z = 0) \mathrm P(Z = 0). }$

The only unknown object is $\displaystyle{ \mathrm P(Y = 1 \mid Z = 0) }$, which is constrained to lie between 0 and 1. Therefore, the identified set is

$\displaystyle{ \Theta_I = \{ p \in [0, 1] : p = \mathrm P(Y = 1 \mid Z = 1) \mathrm P(Z = 1) + q \mathrm P(Z = 0), \text{ for some } q \in [0,1]\}. }$

Given the missing data constraint, the econometrician can only say that $\displaystyle{ \mathrm P(Y = 1) \in \Theta_I }$. This makes use of all available information.

## Statistical inference

Set estimation cannot rely on the usual tools for statistical inference developed for point estimation. A literature in statistics and econometrics studies methods for statistical inference in the context of set-identified models, focusing on constructing confidence intervals or confidence regions with appropriate properties. For example, a method developed by (Chernozhukov Hong) (and which (Lewbel 2019) describes as complicated) constructs confidence regions that cover the identified set with a given probability.