Knockoffs (statistics)

From HandWiki
Short description: Statistical method


In statistics, the knockoff filter, or simply knockoffs, is a framework for variable selection. It was originally introduced for linear regression by Rina Barber and Emmanuel Candès,[1] and later generalized to other regression models in the random design setting.[2] Knockoffs has found application in many practical areas, notably in genome-wide association studies.[2][3]

Fixed-X knockoffs

Consider a linear regression model with response vector [math]\displaystyle{ \mathbf y }[/math] and feature matrix [math]\displaystyle{ \mathbf X }[/math], which is treated as deterministic. A matrix [math]\displaystyle{ \tilde{\mathbf X} }[/math] is said to be knockoffs of [math]\displaystyle{ \mathbf X }[/math] if it does not depend on [math]\displaystyle{ \mathbf y }[/math] and satisfies [math]\displaystyle{ \mathbf X_i^\top\mathbf X_j=\mathbf X_i^\top\tilde{\mathbf X}_j=\tilde{\mathbf X}_i^\top\mathbf X_j=\tilde{\mathbf X}_i^\top\tilde{\mathbf X}_j }[/math] for [math]\displaystyle{ i\ne j }[/math]. Barber and Candès showed that, equipped with a suitable feature importance statistic, fixed-X knockoffs can be used for variable selection while controlling the false discovery rate (FDR).

Model-X knockoffs

Consider a general regression model with response vector [math]\displaystyle{ \mathbf y }[/math] and random feature matrix [math]\displaystyle{ \mathbf X }[/math]. A matrix [math]\displaystyle{ \tilde{\mathbf X} }[/math] is said to be knockoffs of [math]\displaystyle{ \mathbf X }[/math] if it is conditionally independent of [math]\displaystyle{ \mathbf y }[/math] given [math]\displaystyle{ \mathbf X }[/math] and satisfies a subtle pairwise exchangeable condition: for any [math]\displaystyle{ j }[/math], the joint distribution of the random matrix [math]\displaystyle{ [\mathbf X,\tilde{\mathbf X}] }[/math] does not change if its [math]\displaystyle{ j }[/math]th and [math]\displaystyle{ (j+p) }[/math]th columns are swapped, where [math]\displaystyle{ p }[/math] is the number of features. While it is less clear how to create model-X knockoffs compared to their fixed-X counterpart, various algorithms have been proposed to construct knockoffs.[2][3][4][5] Once constructed, model-X knockoffs can be used for variable selection following the same procedure as fixed-X knockoffs and control the FDR.

Properties

The knockoffs [math]\displaystyle{ \tilde{\mathbf X} }[/math] can be understood as negative controls. Informally speaking, knockoffs has the property that no method can statistically distinguish the original matrix from its knockoffs without looking at [math]\displaystyle{ \mathbf y }[/math]. Mathematically, the exchangeability conditions translate to symmetry that allows for an estimation of the type I error (e.g., if one wishes to choose the FDR as the type I error rate, the false discovery proportion is estimated), which then leads to exact type I error control.

Model-X knockoffs provides valid type I error control regardless of the unknown conditional distribution of [math]\displaystyle{ \mathbf y }[/math] given [math]\displaystyle{ \mathbf X }[/math], and it can work with black-box variable importance statistics, including the ones derived from complicated machine learning methods. A most significant challenge of implementing model-X knockoffs is that it requires nontrivial knowledge on the distribution of [math]\displaystyle{ \mathbf X }[/math], which is usually high-dimensional. This knowledge can be gained with the help of unlabeled data.[2]

References

  1. Barber, Rina Foygel; Candès, Emmanuel J. (2015). "Controlling the false discovery rate via knockoffs". Annals of Statistics 43 (5): 2055–2085. 
  2. 2.0 2.1 2.2 2.3 Candès, Emmanuel; Fan, Yingying; Janson, Lucas; Lv, Jinchi (2018). "Panning for gold: model-X knockoffs for high dimensional controlled variable selection". Journal of the Royal Statistical Society. Series B (methodological) (Wiley Online Library) 80 (3): 551–577. 
  3. 3.0 3.1 Sesia, Matteo; Sabatti, Chiara; Candès, Emmanuel (2019). "Gene hunting with hidden Markov model knockoffs". Biometrika 106 (1): 1–18. 
  4. Bates, Stephen; Candès, Emmanuel; Janson, Lucas; Wang, Wenshuo (2020). "Metropolized knockoff sampling". Journal of the American Statistical Association. 
  5. Huang, Dongming; Janson, Lucas (2020). "Relaxing the assumptions of knockoffs by conditioning". Annals of Statistics. 

External links