Drift (data science)

From HandWiki
Short description: Gradual changes in data

In data science and related fields, drift is an evolution of data that invalidates the data model. Common areas where identification of data drift is important are machine learning and data mining, as well as maintenance of large software systems. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Predictive model decay

Main page: Concept drift

In machine learning and predictive analytics this drift phenomenon is called concept drift. In machine learning, a common element of a data model are the statistical properties, such as probability distribution of the actual data. If they deviate from the statistical properties of the training data set, then the learned predictions may become invalid, if the drift is not addressed.[1][2][3][4]

Data configuration decay

Another important area is software engineering, where three types of data drift affecting data fidelity may be recognized. Changes in the software environment ("infrastructure drift") may invalidate software infrastructure configuration. "Structural drift" happens when the data schema changes, which may invalidate databases. "Semantic drift" is changes in the meaning of data while the structure does not change. In many cases this may happen in complicated applications when may independent developers introduce changes without proper awareness of the effects of their changes in other areas of the software system. [5][6]

For many application systems, the nature of data on which they operate are subject to changes for various reasons, e.g., due to changes in business model, system updates, or switching the platform on which the system operates.[6]

In the case of cloud computing, infrastructure drift that may affect the applications running on cloud may be caused by the updates of cloud software.[5]

There are several types of detrimental effects of data drift on data fidelity. Data corrosion is passing the drifted data into the system undetected. Data loss happens when valid data are ignored due to non-conformance with the applied schema. Squandering is the phenomenon when new data fields are introduced upstream the data processing pipeline, but somewhere downstream there data fields are absent.[6]

Inconsistent data

"Data drift" may refer to the phenomenon when database records fail to match the real-world data due to the changes in the latter over time. This is a common problem with databases involving people, such as customers, employees, citizens, residents, etc. Human data drift ay be caused by unrecorded changes in personal data, such as place of residence or name, as well as due to errors during data input.[7]

"Data drift" may refer to inconsistency of data elements in several replicas of a database. The reasons can be difficult to identify. A simple drift detection is to run checksum regularly. However the remedy may be not so easy.[8]

See also

  • Snyk, a company whose portfolio includes drift detection in software applications

References

  1. Koggalahewa, Darshika; Xu, Yue; Foo, Ernest (2021). "A Drift Aware Hierarchical Test Based Approach for Combating Social Spammers in Online Social Networks". Data Mining. Communications in Computer and Information Science. 1504. pp. 47–61. doi:10.1007/978-981-16-8531-6_4. ISBN 978-981-16-8530-9. 
  2. Widmer, Gerhard; Kubat, Miroslav (1996). "Learning in the presence of concept drift and hidden contexts". Machine Learning 23: 69–101. doi:10.1007/BF00116900. 
  3. Xia, Yuan; Zhao, Yunlong (2020). "A Drift Detection Method Based on Diversity Measure and McDiarmid's Inequality in Data Streams". Green, Pervasive, and Cloud Computing. Lecture Notes in Computer Science. 12398. pp. 115–122. doi:10.1007/978-3-030-64243-3_9. ISBN 978-3-030-64242-6. 
  4. Lu, Jie; Liu, Anjin; Dong, Fan; Gu, Feng; Gama, Joao; Zhang, Guangquan (2018). "Learning under Concept Drift: A Review". IEEE Transactions on Knowledge and Data Engineering: 1. doi:10.1109/TKDE.2018.2876857. 
  5. 5.0 5.1 "Driftctl and Terraform, they're two of a kind!"
  6. 6.0 6.1 6.2 Girish Pancha , Big Data's Hidden Scourge: Data Drift, CMSWire, April 8, 2016
  7. Matthew Magne, "Data Drift Happens: 7 Pesky Problems with People Data", InformationWeek, July 19, 2017
  8. Daniel Nichter, Efficient MySQL Performance, 2021, ISBN:1098105060, p. 299