Differential Privacy: Not a complete disaster, I guess
In January of this year I wrote the article Dear Differential Privacy, Put Up or Shut Up. In that story I examined the release of a Facebook URL sharing dataset using Differential Privacy. The data quality was so poor that researchers could not carry out their studies.
The poor data quality led to a scathing editorial from the co-chairs and European Advisory Committee of Social Science One (SS1) demanding that Facebook (FB) release better data.
In February FB released a second dataset protected by Differential Privacy (DP). Joshua Tucker, a professor of politics and Russian studies at New York University, called the dataset “a huge step forward,” according to this article.
Indeed two of the researchers I reached out to confirmed that they were much happier with this dataset, and could in fact draw some research conclusions.
How has the data improved?
The new dataset (Full URLs) includes the demographics age, gender, location, and political ideology which were not in the first DP dataset (URL Light). It tracks monthly counts of URL events (views, shares, etc.) rather than only stating when the URL was first posted. It tracks this information per country, rather than simply stating which country had the most shares. Finally, it has less noise overall.
The Full URL dataset however is still worse than the originally proposed URL dataset from July 2018, which was protected with traditional anonymization mechanisms rather than with DP. The original dataset had weekly counts rather than monthly, per-state tracking rather than per-country, and no noise.
Has Differential Privacy put up? Should I shut up?
DP researchers tend to be glass-half-full kinds of folks. They like to focus on what DP can do, not on what it can’t. They set a low bar, declaring success when a DP mechanism can do anything useful at all. My fear is that the FB Full URL dataset will be held up as a success for DP. It is anything but.
Even the originally proposed dataset was poor. The quality was worse than what census bureaus the world over release safely to the public, and far worse than medical data routinely released to scientists under HIPAA guidelines. The original dataset was essentially an MVP (minimal viable product), and it’s gone downhill from there.
Rather than focus on what the Full URL dataset can do, we should have a look at what it cannot do.
A researcher studying how FB affected an election in the USA would not even be able to distinguish red states from blue states in this data, much less city versus rural, rich county versus poor county, predominantly white versus predominantly black areas, and so on.
The data removes any information about the differences between users in how frequently they share. A researcher cannot tell if given URLs are shared heavily by a small number of users, or a little bit by a large number of users. This is a fundamentally important question in understanding how sharing takes place.
Important information about timing is also lost. News cycles occur in a matter of days. From this data, a researcher would not for instance be able to tell if URLs are predominantly shared during the news cycle or if sharing continued for some days afterwards.
How did FB manage to improve on the URL Light dataset?
Before I go on, I want to make two things clear. First, the data released by FB is very private — excessively so. Second, of the practical DP data releases I’ve seen, this one stays closest to the spirit of DP.
DP is a curious thing. It’s value lies in the fact that it assigns a numeric privacy loss parameter to a data release, and that this parameter is derived through a mathematical proof. In an attempt to get better data quality, however, DP researchers have designed a number of approximations — privacy models that weaken the guarantees in one sense or another. The Full URL dataset uses a slightly weaker guarantee than the URL Light dataset.
A common way for data practitioners to improve data quality with DP is to fudge a bit on the math — simply assume that certain aspects of the mechanism are safe and then leave them out of the math. This is why, for instance, different researchers can come up with different privacy loss parameters for the same mechanism, as for instance between Apple and the researchers who reverse engineered the Apple mechanism.
FB does that here in two ways. First, the data released about URLs is not protected by DP. Rather, it is the data about users — sharing events and demographics — that is DP protected. There is a possibility that something in the URL itself reveals something about a user, and FB addresses this by carefully scrubbing the URLs before releasing them. If FB tried to protect the URLs themselves with DP, the data quality would be even worse.
Second, in its analysis of the privacy loss measure of the Full URLs dataset, FB ignores the fact that the dataset had already been released. This means that the privacy computations in the Full URLs data set description are, in some pure mathematical sense, wrong. The reason is because the privacy loss measure for DP is additive — the more data released, the higher the privacy loss measure. Since FB already released the URLs Light dataset, to be strictly correct the privacy loss measures published in the Full URLs data set should include the privacy loss from the first release.
To be clear again, this does not mean that the combined releases are insecure. But it does seem to me that if one is going to spend four pages on math, one ought to at least get it right (or, state your assumptions).
But if not DP, then what?
I don’t want to be one of these guys that whine about somebody else’s work without offering an alternative. I have two:
- Release the data to researchers with protections similar to what HIPAA prescribes for medical data.
- Use Diffix.
Medical data is routinely released under HIPAA guidelines for medical research with very little loss of data quality. This data is not anonymous by GDPR standards, but when released to a limited number of researchers with strict guidelines, the release is quite safe in practice. It is really unfortunate that the kind of scrutiny FB is under prevents them from doing this.
Diffix is my own work, developed jointly by my research group at the Max Planck Institute for Software Systems and Aircloak GmbH. Diffix provides strong anonymity while still allowing for quite good data quality. It certainly would support all of the analyses listed above and more. Diffix does not have mathematical guarantees, but it is transparent and is even open to bounty programs.
Although Diffix is in use by several large German companies for internal use cases, it is clearly not at the level of maturity where FB is comfortable using it. Nevertheless, I believe that efforts like Diffix, where the focus is on private but practical analytics instead of mathematical guarantees, is the best way forward.