Differential privacy

File:Differential_privacy_informal_definition.png

Differential privacy (DP) is a mathematically rigorous framework for releasing statistical information about datasets while protecting the privacy of individual data subjects. It enables a data holder to share aggregate patterns of the group while limiting information that is leaked about specific individuals.{{cite web| title=Differential Privacy: A Historical Survey| author1=Hilton, M| author2=Cal| url=https://www.semanticscholar.org/paper/Differential-Privacy-%3A-A-Historical-Survey-Hilton-Cal/4c99097af05e8de39370dd287c74653b715c8f6a| publisher=Semantic Scholar| date=2012| access-date=31 December 2023| s2cid=16861132}}{{Cite book |title=Theory and Applications of Models of Computation |volume=4978 |last=Dwork |first=Cynthia |date=2008-04-25 |publisher=Springer Berlin Heidelberg |isbn=978-3-540-79227-7 |editor-last=Agrawal |editor-first=Manindra |series=Lecture Notes in Computer Science |pages=1–19 |language=en |chapter=Differential Privacy: A Survey of Results |doi=10.1007/978-3-540-79228-4_1 |s2cid=2887752 |editor-last2=Du |editor-first2=Dingzhu |editor-last3=Duan |editor-first3=Zhenhua |editor-last4=Li |editor-first4=Angsheng |chapter-url=https://www.microsoft.com/en-us/research/publication/differential-privacy-a-survey-of-results/}} This is done by injecting carefully calibrated noise into statistical computations such that the utility of the statistic is preserved while provably limiting what can be inferred about any individual in the dataset.

Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell whether a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms provably resist such attacks.[https://link.springer.com/chapter/10.1007%2F11681878_14 Calibrating Noise to Sensitivity in Private Data Analysis] by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith. In Theory of Cryptography Conference (TCC), Springer, 2006. {{doi|10.1007/11681878_14}}. The [https://journalprivacyconfidentiality.org/index.php/jpc/article/view/405 full version] appears in Journal of Privacy and Confidentiality, 7 (3), 17-51. {{doi|10.29012/jpc.v7i3.405}}

ε-differential privacy

File:Differential_privacy_formal_definition.png

The 2006 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database.{{Cite journal |last=HILTON |first=MICHAEL |title=Differential Privacy: A Historical Survey |url=https://pdfs.semanticscholar.org/4c99/097af05e8de39370dd287c74653b715c8f6a.pdf |url-status=dead |s2cid=16861132 |archive-url=https://web.archive.org/web/20170301180826/https://pdfs.semanticscholar.org/4c99/097af05e8de39370dd287c74653b715c8f6a.pdf |archive-date=2017-03-01}} (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)

The definition of ε-differential privacy requires that a change to one entry in a database only creates a small change in the probability distribution of the outputs of measurements, as seen by the attacker. The intuition for the definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database.{{Cite book |last=Dwork |first=Cynthia |chapter=Differential Privacy: A Survey of Results |series=Lecture Notes in Computer Science |date=2008 |volume=4978 |editor-last=Agrawal |editor-first=Manindra |editor2-last=Du |editor2-first=Dingzhu |editor3-last=Duan |editor3-first=Zhenhua |editor4-last=Li |editor4-first=Angsheng |title=Theory and Applications of Models of Computation |chapter-url=https://link.springer.com/chapter/10.1007/978-3-540-79228-4_1 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=1–19 |doi=10.1007/978-3-540-79228-4_1 |isbn=978-3-540-79228-4}} In differential privacy, each individual is given roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not be substantially affected by the removal, addition, or change of any individual in the data.

How much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."{{Citation needed|date=July 2024}}

= Definition =

Let ε be a positive real number and $\mathcal{A}$ be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let $\textrm{im}\ \mathcal{A}$ denote the image of $\mathcal{A}$ .

The algorithm $\mathcal{A}$ is said to provide (ε, δ)-differential privacy if, for all datasets $D_1$ and $D_2$ that differ on a single element (i.e., the data of one person), and all subsets $S$ of $\textrm{im}\ \mathcal{A}$ :

{{center|1=

$\Pr[\mathcal{A}(D_1) \in S] \leq e^\varepsilon \Pr[\mathcal{A}(D_2) \in S] + \delta.$

}}

where the probability is taken over the randomness used by the algorithm.[http://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf The Algorithmic Foundations of Differential Privacy] by Cynthia Dwork and Aaron Roth. Foundations and Trends in Theoretical Computer Science. Vol. 9, no. 3–4, pp. 211‐407, Aug. 2014. {{doi|10.1561/0400000042}} This definition is sometimes called "approximate differential privacy", with "pure differential privacy" being a special case when $\delta = 0$ . In the latter case, the algorithm is commonly said to satisfy ε-differential privacy (i.e., omitting $\delta = 0$ ).{{Citation needed|date=July 2024}}

Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data.{{Citation needed|date=July 2024}}

= Example =

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example, assume we have a database of medical records $D_1$ where each record is a pair (Name, X), where $X$ is a Boolean denoting whether a person has diabetes or not. For example:

class="wikitable" style="margin-left: auto; margin-right: auto; border: none;" ! Name !! Has Diabetes (X)
Ross	1
Monica	1
Joey	0
Phoebe	0
Chandler	1
Rachel	0

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query $Q_i$ that returns the partial sum of the first $i$ rows of column $X$ in the database. In order to find Chandler's diabetes status the adversary executes $Q_5(D_1)$ and $Q_4(D_1)$ , then computes their difference. In this example, $Q_5(D_1) = 3$ and $Q_4(D_1) = 2$ , so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Continuing this example, if we construct $D_2$ by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish $D_2$ from $D_1$ by computing $Q_5 - Q_4$ for each dataset. If the adversary were required to receive the values $Q_i$ via an $\varepsilon$ -differentially private algorithm, for a sufficiently small $\varepsilon$ , then he or she would be unable to distinguish between the two datasets.

= Composability and robustness to post processing =

Composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy.

Sequential composition. If we query an ε-differential privacy mechanism $t$ times, and the randomization of the mechanism is independent for each query, then the result would be $\varepsilon t$ -differentially private. In the more general case, if there are $n$ independent mechanisms: $\mathcal{M}_1,\dots,\mathcal{M}_n$ , whose privacy guarantees are $\varepsilon_1,\dots,\varepsilon_n$ differential privacy, respectively, then any function $g$ of them: $g(\mathcal{M}_1,\dots,\mathcal{M}_n)$ is $\left(\sum\limits_{i=1}^{n} \varepsilon_i\right)$ -differentially private.[http://research.microsoft.com/pubs/80218/sigmod115-mcsherry.pdf Privacy integrated queries: an extensible platform for privacy-preserving data analysis] by Frank D. McSherry. In Proceedings of the 35th SIGMOD International Conference on Management of Data (SIGMOD), 2009. {{doi|10.1145/1559845.1559850}}

Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function $g$ would be $(\max_i \varepsilon_i)$ -differentially private instead.

The other important property for modular use of differential privacy is robustness to post processing. This is defined to mean that for any deterministic or randomized function $F$ defined over the image of the mechanism $\mathcal{A}$ , if $\mathcal{A}$ satisfies ε-differential privacy, so does $F(\mathcal{A})$ .

The property of composition permits modular construction and analysis of differentially private mechanisms and motivates the concept of the privacy loss budget.{{Citation needed|date=July 2024}} If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

= Group privacy =

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted their information. However this is also extendable. We may want to protect databases differing in $c$ rows, which amounts to an adversary with arbitrary auxiliary information knowing if $c$ particular participants submitted their information. This can be achieved because if $c$ items change, the probability dilation is bounded by $\exp ( \varepsilon c )$ instead of $\exp ( \varepsilon )$ ,[http://research.microsoft.com/pubs/64346/dwork.pdf Differential Privacy] by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12. {{doi|10.1007/11787006_1}} i.e., for D₁ and D₂ differing on $c$ items: $\Pr[\mathcal{A}(D_{1})\in S]\leq
\exp(\varepsilon c)\cdot\Pr[\mathcal{A}(D_{2})\in S]\,\!$ Thus setting ε instead to $\varepsilon/c$ achieves the desired result (protection of $c$ items). In other words, instead of having each item ε-differentially private protected, now every group of $c$ items is ε-differentially private protected (and each item is $(\varepsilon/c)$ -differentially private protected).

= Hypothesis testing interpretation =

One can think of differential privacy as bounding the error rates in a hypothesis test. Consider two hypotheses:

$H_0$ : The individual's data is not in the dataset.

$H_1$ : The individual's data is in the dataset.

Then, there are two error rates:

False Positive Rate (FPR): $P_\text{FP} = \Pr[\text{Adversary guesses } H_1 \mid H_0 \text{ is true}].$

False Negative Rate (FNR): $P_\text{FN} = \Pr[\text{Adversary guesses } H_0 \mid H_1 \text{ is true}].$

Ideal protection would imply that both error rates are equal, but for a fixed (ε, δ) setting, an attacker can achieve the following rates:Kairouz, Peter, Sewoong Oh, and Pramod Viswanath. "The composition theorem for differential privacy." International conference on machine learning. PMLR, 2015.[https://proceedings.mlr.press/v37/kairouz15.pdf link]

$\{(P_\text{FP}, P_\text{FN}) \mid P_\text{FP} + e^\varepsilon P_\text{FN} \geq 1 - \delta, \ e^\varepsilon P_\text{FP} + P_\text{FN} \geq 1 - \delta \}$

ε-differentially private mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[http://research.microsoft.com/pubs/65075/mdviadp.pdf F.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007.] and posterior sampling[https://arxiv.org/abs/1306.1066 Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, Benjamin Rubinstein. Robust and Private Bayesian Inference. Algorithmic Learning Theory 2014] sample from a problem-dependent family of distributions instead.

An important definition with respect to ε-differentially private mechanisms is sensitivity. Let $d$ be a positive integer, $\mathcal{D}$ be a collection of datasets, and $f \colon \mathcal{D} \rightarrow \mathbb{R}^d$ be a function. One definition of the sensitivity of a function, denoted $\Delta f$ , can be defined by: $\Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1,$ where the maximum is over all pairs of datasets $D_1$ and $D_2$ in $\mathcal{D}$ differing in at most one element and $\lVert \cdot \rVert_1$ denotes the L1 norm. In the example of the medical database below, if we consider $f$ to be the function $Q_i$ , then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one. This can be generalized to other metric spaces (measures of distance), and must be to make certain differentially private algorithms work, including adding noise from the Gaussian distribution (which requires the L2 norm) instead of the Laplace distribution.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions, with parameters that vary depending on their sensitivity.

= Laplace mechanism =

{{See also|Additive noise mechanisms}}{{Technical|date=July 2024|section}}File:Laplace_mechanism.png

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function $\text{noise}(y)\propto \exp(-|y|/\lambda)\,\!$ , which has mean zero and standard deviation $\sqrt{2} \lambda\,\!$ ). Now in our case we define the output function of $\mathcal{A}\,\!$ as a real valued function (called as the transcript output by $\mathcal{A}\,\!$ ) as $\mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\!$ where $Y \sim \text{Lap}(\lambda)\,\!\,\!$ and $f\,\!$ is the original real valued query/function we planned to execute on the database. Now clearly $\mathcal{T}_{\mathcal{A}}(x)\,\!$ can be considered to be a continuous random variable, where

: $\frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\!$

which is at most $e^{\frac$

f(D_{1})-f(D_{2})

{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\!. We can consider

\frac{\Delta(f)}{\lambda}\,\!

to be the privacy factor

\varepsilon\,\!

. Thus

\mathcal{T}\,\!

follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have

\mathcal{A}\,\!

as the

\varepsilon\,\!

-differential private algorithm we need to have

\lambda=1/\varepsilon\,\!

. Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.

= Randomized response =

A simple example, especially developed in the social sciences,{{cite journal |last=Warner |first=S. L. |date=March 1965 |title=Randomised response: a survey technique for eliminating evasive answer bias |jstor=2283137 |journal=Journal of the American Statistical Association |publisher=Taylor & Francis |volume=60 |issue=309 |pages=63–69 |doi=10.1080/01621459.1965.10480775 |pmid=12261830 |s2cid=35435339}} is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

Toss a coin.
If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
If tails, then toss the coin again and answer "Yes" if heads, "No" if tails.

(The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses.

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.Dwork, Cynthia. "A firm foundation for private data analysis." Communications of the ACM 54.1 (2011): 86–95, supra note 19, page 91.Bambauer, Jane, Krishnamurty Muralidhar, and Rathindra Sarathy. "Fool's gold: an illustrated critique of differential privacy." Vand. J. Ent. & Tech. L. 16 (2013): 701.

= Stable transformations =

A transformation $T$ is $c$ -stable if the Hamming distance between $T(A)$ and $T(B)$ is at most $c$ -times the Hamming distance between $A$ and $B$ for any two databases $A,B$ .{{Citation needed|date=July 2024}} If there is a mechanism $M$ that is $\varepsilon$ -differentially private, then the composite mechanism $M\circ T$ is $(\varepsilon \times c)$ -differentially private.

This could be generalized to group privacy, as the group size could be thought of as the Hamming distance $h$ between

$A$ and $B$ (where $A$ contains the group and $B$ does not). In this case $M\circ T$ is $(\varepsilon \times c \times h)$ -differentially private.{{Citation needed|date=July 2024}}

Research

= Early research leading to differential privacy =

In 1977, Tore Dalenius formalized the mathematics of cell suppression.{{cite journal |author=Tore Dalenius |year=1977 |title=Towards a methodology for statistical disclosure control |url=https://hdl.handle.net/1813/111303 |journal=Statistik Tidskrift |volume=15|hdl=1813/111303 }} Tore Dalenius was a Swedish statistician who contributed to statistical privacy through his 1977 paper that revealed a key point about statistical databases, which was that databases should not reveal information about an individual that is not otherwise accessible.{{Cite book |last=Dwork |first=Cynthia |date=2006 |editor-last=Bugliesi |editor-first=Michele |editor2-last=Preneel |editor2-first=Bart |editor3-last=Sassone |editor3-first=Vladimiro |editor4-last=Wegener |editor4-first=Ingo |chapter=Differential Privacy |chapter-url=https://link.springer.com/chapter/10.1007/11787006_1 |title=Automata, Languages and Programming |series=Lecture Notes in Computer Science |volume=4052 |language=en |location=Berlin, Heidelberg |publisher=Springer |pages=1–12 |doi=10.1007/11787006_1 |isbn=978-3-540-35908-1}} He also defined a typology for statistical disclosures.

In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results.{{cite journal |author=Dorothy E. Denning |author2=Peter J. Denning |author3=Mayer D. Schwartz |date=March 1979 |title=The Tracker: A Threat to Statistical Database Security |url=https://dl.acm.org/doi/pdf/10.1145/320064.320069 |journal=ACM Transactions on Database Systems |volume=4 |pages=76–96 |doi=10.1145/320064.320069 |s2cid=207655625 |number=1|url-access=subscription }} This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called query privacy, with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.{{Citation needed|date=July 2024}}

= 21st century =

In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work.Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '03). ACM, New York, NY, USA, 202–210. {{doi|10.1145/773153.773173}} The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy.{{Citation needed|date=July 2024|reason=Source needed to place it in this context}}

In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so.{{Citation needed|date=July 2024|reason=Source needed to place it in this context}} This paper also created the first formal definition of differential privacy. Their work was a co-recipient of the 2016 TCC Test-of-Time Award{{cite web |title=TCC Test-of-Time Award |url=https://www.iacr.org/workshops/tcc/awards.html}} and the 2017 Gödel Prize.{{cite web |title=2017 Gödel Prize |work=EATCS |url=https://www.eatcs.org/index.php/component/content/article/1-news/2450-2017-godel-prize |last1=Chita |first1=Efi }}

Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.

Adoption in real-world applications

To date there are over 12 real-world deployments of differential privacy, the most noteworthy being:

2008: U.S. Census Bureau, for showing commuting patterns.Ashwin Machanavajjhala, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. "Privacy: Theory meets Practice on the Map". In Proceedings of the 24th International Conference on Data Engineering, ICDE) 2008.
2014: Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings.{{cite book | chapter-url=https://dl.acm.org/doi/10.1145/2660267.2660348 | doi=10.1145/2660267.2660348 | chapter=RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response | title=Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security | date=2014 | last1=Erlingsson | first1=Úlfar | last2=Pihur | first2=Vasyl | last3=Korolova | first3=Aleksandra | pages=1054–1067 | arxiv=1407.6981 | isbn=978-1-4503-2957-6 }}{{Citation |title=google/rappor |date=2021-07-15 |url=https://github.com/google/rappor |publisher=GitHub}}
2015: Google, for sharing historical traffic statistics.[https://europe.googleblog.com/2015/11/tackling-urban-mobility-with-technology.html Tackling Urban Mobility with Technology] by Andrew Eland. Google Policy Europe Blog, Nov 18, 2015.
2016: Apple iOS 10, for use in Intelligent personal assistant technology.{{cite web |title=Apple – Press Info – Apple Previews iOS 10, the Biggest iOS Release Ever |url=https://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever/ |access-date=20 June 2023 |website=Apple}}
2017: Microsoft, for telemetry in Windows.[https://www.microsoft.com/en-us/research/publication/collecting-telemetry-data-privately/ Collecting telemetry data privately] by Bolin Ding, Jana Kulkarni, Sergey Yekhanin. NIPS 2017.
2020: [https://socialscience.one Social Science One] and Facebook, a 55 trillion cell dataset for researchers to learn about elections and democracy.{{Citation |last1=Messing |first1=Solomon |title=Facebook Privacy-Protected Full URLs Data Set |date=2020 |url=https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/TDOAPG |access-date=2023-02-08 |others=Zagreb Mukerjee |publisher=Harvard Dataverse |doi=10.7910/dvn/tdoapg |last2=DeGregorio |first2=Christina |last3=Hillenbrand |first3=Bennett |last4=King |first4=Gary |last5=Mahanti |first5=Saurav |last6=Mukerjee |first6=Zagreb |last7=Nayak |first7=Chaya |last8=Persily |first8=Nate |last9=State |first9=Bogdan|chapter=Social Sciences }}{{Cite journal |last1=Evans |first1=Georgina |last2=King |first2=Gary |date=January 2023 |title=Statistically Valid Inferences from Differentially Private Data Releases, with Application to the Facebook URLs Dataset |url=https://www.cambridge.org/core/product/identifier/S1047198722000018/type/journal_article |journal=Political Analysis |language=en |volume=31 |issue=1 |pages=1–21 |doi=10.1017/pan.2022.1 |issn=1047-1987 |s2cid=211137209|url-access=subscription }}
2021: The US Census Bureau uses differential privacy to release redistricting data from the 2020 Census.{{cite web |date=2 November 2021 |title=Disclosure Avoidance for the 2020 Census: An Introduction |url=https://www.census.gov/library/publications/2021/decennial/2020-census-disclosure-avoidance-handbook.html}}

Public purpose considerations

There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology:{{Cite web |title=Technology Factsheet: Differential Privacy |url=https://www.belfercenter.org/publication/technology-factsheet-differential-privacy |access-date=2021-04-12 |website=Belfer Center for Science and International Affairs |language=en}}

Data utility and accuracy. The main concern with differential privacy is the trade-off between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the trade-offs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this trade-off.
Data privacy and security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit trade-off between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.

Attacks in practice

Because differential privacy techniques are implemented on real computers, they are vulnerable to various attacks not possible to compensate for solely in the mathematics of the techniques themselves. In addition to standard defects of software artifacts that can be identified using testing or fuzzing, implementations of differentially private mechanisms may suffer from the following vulnerabilities:

Subtle algorithmic or analytical mistakes.{{cite web |last1=McSherry |first1=Frank |date=25 February 2018 |title=Uber's differential privacy .. probably isn't |url=https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-25.md |website=GitHub}}{{cite journal |last1=Lyu |first1=Min |last2=Su |first2=Dong |last3=Li |first3=Ninghui |date=1 February 2017 |title=Understanding the sparse vector technique for differential privacy |journal=Proceedings of the VLDB Endowment |volume=10 |issue=6 |pages=637–648 |arxiv=1603.01699 |doi=10.14778/3055330.3055331 |s2cid=5449336}}
Timing side-channel attacks.{{cite journal |last1=Haeberlen |first1=Andreas |last2=Pierce |first2=Benjamin C. |last3=Narayan |first3=Arjun |date=2011 |title=Differential Privacy Under Fire |journal=20th USENIX Security Symposium}} In contrast with timing attacks against implementations of cryptographic algorithms that typically have low leakage rate and must be followed with non-trivial cryptanalysis, a timing channel may lead to a catastrophic compromise of a differentially private system, since a targeted attack can be used to exfiltrate the very bit that the system is designed to hide.
Leakage through floating-point arithmetic.{{cite book |last1=Mironov |first1=Ilya |url=https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf |title=Proceedings of the 2012 ACM conference on Computer and communications security |date=October 2012 |publisher=ACM |isbn=9781450316514 |pages=650–661 |chapter=On significance of the least significant bits for differential privacy |doi=10.1145/2382196.2382264 |s2cid=3421585}} Differentially private algorithms are typically presented in the language of probability distributions, which most naturally lead to implementations using floating-point arithmetic. The abstraction of floating-point arithmetic is leaky, and without careful attention to details, a naive implementation may fail to provide differential privacy. (This is particularly the case for ε-differential privacy, which does not allow any probability of failure, even in the worst case.) For example, the support of a textbook sampler of the Laplace distribution (required, for instance, for the Laplace mechanism) is less than 80% of all double-precision floating point numbers; moreover, the support for distributions with different means are not identical. A single sample from a naïve implementation of the Laplace mechanism allows distinguishing between two adjacent datasets with probability more than 35%.
Timing channel through floating-point arithmetic.{{cite book |last1=Andrysco |first1=Marc |title=2015 IEEE Symposium on Security and Privacy |last2=Kohlbrenner |first2=David |last3=Mowery |first3=Keaton |last4=Jhala |first4=Ranjit |last5=Lerner |first5=Sorin |last6=Shacham |first6=Hovav |date=May 2015 |isbn=978-1-4673-6949-7 |pages=623–639 |chapter=On Subnormal Floating Point and Abnormal Timing |doi=10.1109/SP.2015.44 |s2cid=1903469}} Unlike operations over integers that are typically constant-time on modern CPUs, floating-point arithmetic exhibits significant input-dependent timing variability.{{cite journal |last1=Kohlbrenner |first1=David |last2=Shacham |first2=Hovav |date=August 2017 |title=On the Effectiveness of Mitigations Against Floating-point Timing Channels |journal=Proceedings of the 26th USENIX Conference on Security Symposium |publisher=USENIX Association |pages=69–81}} Handling of subnormals can be particularly slow, as much as by ×100 compared to the typical case.{{cite journal |last1=Dooley |first1=Isaac |last2=Kale |first2=Laxmikant |date=September 2006 |title=Quantifying the interference caused by subnormal floating-point values |url=https://charm.cs.illinois.edu/newPapers/06-13/paper.pdf |journal=Proceedings of the Workshop on Operating System Interference in High Performance Applications}}

References

= Publications =

[https://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf Calibrating noise to sensitivity in private data analysis], Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. In Proceedings of the Third conference on Theory of Cryptography (TCC'06). Springer-Verlag, Berlin, Heidelberg, 265–284. https://doi.org/10.1007/11681878_14 (This is the original publication of Differential Privacy, and not the eponymous article by Dwork that was published the same year.)
[http://research.microsoft.com/apps/pubs/default.aspx?id=74339 Differential Privacy: A Survey of Results] by Cynthia Dwork, Microsoft Research, April 2008 (Presents what was discovered during the first two years of research on differential privacy.)
[https://scholarship.law.vanderbilt.edu/cgi/viewcontent.cgi?article=1058&context=jetlaw Differential Privacy: A Primer for a Non-Technical Audience], Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, et al, Vanderbilt Journal of Entertainment & Technology LawVanderbilt Journal of Entertainment, Volume 21, Issue 1, Fall 2018. (A good introductory document, but definitely *not* for non-technical audiences!)
[https://www.belfercenter.org/publication/technology-factsheet-differential-privacy Technology Factsheet: Differential Privacy] by Raina Gandhi and Amritha Jayanti, Belfer Center for Science and International Affairs, Fall 2020
[https://mit-serc.pubpub.org/pub/differential-privacy-2020-us-census/release/1 Differential Privacy and the 2020 US Census], MIT Case Studies in Social and Ethical Responsibilities of Computing, no. Winter 2022 (January). https://doi.org/10.21428/2c646de5.7ec6ab93.
{{cite book|title=Differential Privacy|last1=Garfinkel|first1=Simson|publisher=MIT Press|series=MIT Press Essential Knowledge|year=2025|url=https://mitpress.mit.edu/9780262551656/differential-privacy/|isbn=9780262551656|doi=10.7551/mitpress/15354.001.0001}} {{open access}}
Bowen, Claire McKay and Simson Garfinkel, [https://www.ams.org/journals/notices/202110/rnoti-p1727.pdf The Philosophy of Differential Privacy], AMS Notices, November 2021.

= Tutorials =

[http://www.cerias.purdue.edu/news_and_events/events/security_seminar/details/index/j9cvs3as2h1qds1jrdqfdc3hu8 A Practical Beginner's Guide To Differential Privacy] by Christine Task, Purdue University, April 2012

Category:Theory of cryptography

Category:Information privacy