Can the items you buy be used to identify you?
Article first appeared in Privacy Association.org
A Science magazine article analyzes credit card transaction data on 1.1 million individuals from an unnamed OECD country. One of the author’s key conclusions is that using only four transactions, 90% of the people are unique. Actually, they say that 90% of the people can be re-identified (ignoring the distinction between uniqueness and re-identification).
This conclusion has then been repeated uncritically by the science and general media communities.
Kahled El Elmen wrote a critique of the article on BMJ.com.
A credit card transaction consists of a date and a shop. For example, Sally may have gone to Pharmacy-R-Us on December 26th and Butcher Joe on December 27th. This would be an example of a two transaction trace for Sally. If Sally’s transaction trace is unique, it means that she is the only person who has that particular trace (i.e., she is the only person who shopped at these two locations on these two days).
If an adversary wants to re-identify individuals she needs to have background information about the data subject being re-identified. The authors seemed to assume that the adversary would know when the transaction occurred and where, as well as the price.
There is reason to believe that the 1.1 million people are from a country such that they are a sample from a population of approximately 22 million adults who could have credit cards. This means that the 1.1 million individuals in the data that was analyzed represent only five percent of the population. A most basic principle in measuring the risk of re-identification is that risk must be measured on the population and not from the sample. If 90% of the sample is unique that does not necessarily mean that 90% of the population is unique on a trace of four credit card transactions. In fact, the number of unique individuals in the population could be much smaller and you could still have 90% unique individuals in the 1.1 million sample.
The best way to illustrate the implications of this is to do a simulation. With a 5% sample of 1.1 million people. It is very unlikely that 90% of the people in the 1.1 million person sample are unique if his population is also 90% unique. In fact, the population needs less than 1% uniqueness to get 90% uniqueness in the Science magazine study. A much more likely conclusion from the data is that less than 1% of the population is unique on four credit card transactions. The key point here is that having 90% unique individuals in his sample data does not translate directly to 90% unique individuals in the population—and the discrepancy can be huge.
The authors of the study drew conclusions based on uniqueness in the sample, which inflates the re-identification risks, especially when the sample is as small as 5%. This is a basic disclosure control principle. The actual risk value needs to be computed from the population.
The analysis in that article was incorrect and the estimates very likely exaggerated the re-identification risk.
https://privacyassociation.org/news/a/on-re-identification-not-really-unique-in-the-shopping-mall/