Dataset shift
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
(Learn how and when to remove this template message)
|
Dataset shift is a phenomenon in machine learning and statistics in which the joint distribution of input variables and target labels is different in the training phase and the deployment or test phase (i.e., ).[1][2][3] This happens when the statistical properties of data used to train a model are no longer representative of the data encountered in real-world use, often resulting in degraded predictive performance and diminished generalization ability.[4][5]
Dataset shift is a generic term for a number of particular types of distributional change. Covariate shift is when the distribution of the input features changes, but the conditional relationship between inputs and outputs remains constant .[6][7] Prior probability shift (or label shift) happens when the distribution of target labels changes, but the conditional distribution of inputs given labels stays the same.[8][9] Concept shift (also known as concept drift) is the change of the conditional relationship between inputs and outputs that renders previously learned patterns invalid over time.[10][circular reference]
A key challenge for deploying machine learning systems is dataset shift, in particular in dynamic environments where the data distributions change over time. Detecting and mitigating such shifts is an active area of research, e.g., drift detection, domain adaptation, continual learning.[11]
See also
References
- ↑ Moreno-Torres, José G. (2012). "A unifying view on dataset shift in classification". Pattern Recognition 45 (1): 521–530. doi:10.1016/j.patcog.2011.06.019.
- ↑ Quiñonero-Candela, Joaquin, ed (2010). Dataset shift in machine learning. Neural information processing series. Cambridge, Mass: MIT Press. ISBN 978-0-262-17005-5.
- ↑ "Dataset shift". https://neuralnetworklexicon.com/data-and-distribution/dataset-shift/.
- ↑ Kumar, Rajesh. "What is dataset shift?". https://aiopsschool.com/blog/dataset-shift/.
- ↑ Bayram, Firas; Ahmed, Bestoun S.; Kassler, Andreas (2022-06-07). "From concept drift to model degradation: An overview on performance-aware drift detectors". Knowledge-Based Systems 245. doi:10.1016/j.knosys.2022.108632. ISSN 0950-7051. https://www.sciencedirect.com/science/article/pii/S0950705122002854.
- ↑ Shimodaira, Hidetoshi (2000-10-01). "Improving predictive inference under covariate shift by weighting the log-likelihood function". Journal of Statistical Planning and Inference 90 (2): 227–244. doi:10.1016/S0378-3758(00)00115-4. ISSN 0378-3758. https://www.sciencedirect.com/science/article/pii/S0378375800001154.
- ↑ Raitoharju, Jenni (2022-01-01), "Convolutional neural networks" (in en-US), Deep Learning for Robot Perception and Cognition (Academic Press): pp. 35–69, doi:10.1016/B978-0-32-385787-1.00008-7, ISBN 978-0-323-85787-1, https://www.sciencedirect.com/science/chapter/edited-volume/abs/pii/B9780323857871000087, retrieved 2026-04-28
- ↑ "Dataset shift explanation". https://insightful-data-lab.com/2025/08/20/dataset-shift/.
- ↑ Huyen, Chip (2022). Designing machine learning systems: an iterative process for production-ready applications (1st ed.). Sebastopol, CA: O'Reilly Media, Inc. ISBN 978-1-0981-0796-3.
- ↑ Silva, Gabriel Ferreira dos Santos; Barcellos Filho, Fabiano Novaes; Wichmann, Roberta Moreira; da Silva Junior, Francisco Costa; Chiavegatto Filho, Alexandre Dias Porto (2025-10-01). "Strategies for detecting and mitigating dataset shift in machine learning for health predictions: A systematic review". Journal of Biomedical Informatics 170. doi:10.1016/j.jbi.2025.104902. ISSN 1532-0464. PMID 40876698. https://www.sciencedirect.com/science/article/pii/S1532046425001315.
- ↑ "Drift in machine learning". https://dataeval.readthedocs.io/en/v0.86.8/concepts/Drift.html.
