Unicity (data analysis)

From HandWiki

Unicity ([math]\displaystyle{ \varepsilon_p }[/math]) is a risk metric for measuring the re-identifiability of high-dimensional anonymous data. First introduced in 2013,[1] unicity is measured by the number of points p needed to uniquely identify an individual in a data set. The fewer points needed, the more unique the traces are and the easier they would be to re-identify using outside information.

In a high-dimensional, human behavioural data set, such as mobile phone meta-data, for each person, there exists potentially thousands of different records. In the case of mobile phone meta-data, credit card transaction histories and many other types of personal data, this information includes the time and location of an individual.

In research, unicity is widely used to illustrate the re-identifiability of anonymous data sets. In 2013[1] researchers from the MIT Media Lab showed that only 4 points needed to uniquely identify 95% of individual trajectories in a de-identified data set of 1.5 million mobility trajectories. These points were location-time pairs that appeared with the resolution of 1 hour and 0.15 km² to 15 km². These results were shown to hold true for credit card transaction data as well[2] with 4 points being enough to re-identify 90% of trajectories. Further research studied the unicity of the apps installed by people on their smartphones,[3] the trajectories of vehicles,[4] mobile phone data from Boston and Singapore,[5] and, public transport data in Singapore obtained from smartcards.[6]

Measuring unicity

Unicity ([math]\displaystyle{ \varepsilon_p }[/math]) is formally defined as the expected value of the fraction of uniquely identifiable trajectories, given p points selected from those trajectories uniformly at random. A full computation of [math]\displaystyle{ \varepsilon_p }[/math] of a data set [math]\displaystyle{ D }[/math] requires picking p points uniformly at random from each trajectory [math]\displaystyle{ T_i \in D }[/math], and then checking whether or not any other trajectory also contains those p points. Averaging over all possible sets of p points for each trajectory results in a value for [math]\displaystyle{ \varepsilon_p }[/math]. This is usually prohibitively expensive[3] as it requires considering every possible set of p points for each trajectory in the data set — trajectories that sometimes contain thousands of points.[1][2]

Instead, unicity is usually estimated using sampling techniques. Specifically, given a data set [math]\displaystyle{ D }[/math], the estimated unicity is computed by sampling from [math]\displaystyle{ D }[/math] a fraction of the trajectories [math]\displaystyle{ S }[/math] and then checking whether each of the trajectories [math]\displaystyle{ T_j \in S }[/math] are unique in [math]\displaystyle{ D }[/math] given p randomly selected points from each [math]\displaystyle{ T_j }[/math]. The fraction of [math]\displaystyle{ S }[/math] that is uniquely identifiable is then the unicity estimate.

See also

  • Quasi-identifier
  • Personally Identifiable Information

References

  1. 1.0 1.1 1.2 de Montjoye, Yves-Alexandre; Hidalgo, César A.; Verleysen, Michel; Blondel, Vincent D. (2013). "Unique in the Crowd: The privacy bounds of human mobility". Scientific Reports 3: 1376. doi:10.1038/srep01376. PMID 23524645. Bibcode2013NatSR...3E1376D. 
  2. 2.0 2.1 de Montjoye, Yves-Alexandre; Radealli, Laura; Singh, Vivek Kumar; Pentland, Alex "Sandy" (2015). "Unique in the shopping mall: On the reidentifiability of credit card metadata". Science 347 (6221): 536–539. doi:10.1126/science.1256297. PMID 25635097. Bibcode2015Sci...347..536D. 
  3. 3.0 3.1 Achara, Jagdish Prasad; Gergely, Acs; Castelluccia, Claude (2015). "On the Unicity of Smartphone Applications". Proceedings of the 14th ACM Workshop on Privacy in the Electronic Society. ACM. pp. 27–36. doi:10.1145/2808138.2808146. ISBN 9781450338202. Bibcode2015arXiv150707851P. https://dl.acm.org/citation.cfm?id=2808146. Retrieved 2018-11-22. 
  4. Pellungrini, Roberto; Pappalarado, Luca; Pratesi, Francesca; Monreale, Anna (2018). "A Data Mining Approach to Assess Privacy Risk in Human Mobility Data". ACM Transactions on Intelligent Systems and Technology (ACM) 9 (3): 1–27. doi:10.1145/3106774. 
  5. Xu, Yang; Belyi, Alexander; Bojic, Iva; Ratti, Carlo (2018). "Human mobility and socioeconomic status: Analysis of Singapore and Boston". Computers, Environment and Urban Systems (Elsevier) 72 (November 2018): 51–67. doi:10.1016/j.compenvurbsys.2018.04.001. https://doi.org/10.1016/j.compenvurbsys.2018.04.001. Retrieved 2018-11-22. 
  6. Kondor, Daniel; Hashemian, Behrooz; de Montjoye, Yves-Alexandre (2018). "Towards matching user mobility traces in large-scale datasets". IEEE Transactions on Big Data (IEEE) 6 (4): 714–726. doi:10.1109/TBDATA.2018.2871693.