Data anonymization
{{Short description|Type of information sanitization}}
{{Redirect|Anonymization|anonymity on the Internet|Anonymity#Anonymity on the Internet}}
{{Distinguish|Data cleansing}}
Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous.
Overview
Data anonymization has been defined as a "process by which personal data is altered in such a way that a data subject can no longer be identified directly or indirectly, either by the data controller alone or in collaboration with any other party."{{cite book|title=ISO 25237:2017 Health informatics -- Pseudonymization|year=2017|publisher=ISO|page=7|url=https://www.iso.org/standard/63553.html}} Data anonymization may enable the transfer of information across a boundary, such as between two departments within an agency or between two agencies, while reducing the risk of unintended disclosure, and in certain environments in a manner that enables evaluation and analytics post-anonymization.
In the context of medical data, anonymized data refers to data from which the patient cannot be identified by the recipient of the information. The name, address, and full postcode must be removed, together with any other information which, in conjunction with other data held by or disclosed to the recipient, could identify the patient.{{cite web|title=Data anonymization|url=http://medical-dictionary.thefreedictionary.com/Anonymized+Data|work=The Free Medical Dictionary|access-date=17 January 2014}}
There will always be a risk that anonymized data may not stay anonymous over time. Pairing the anonymized dataset with other data, clever techniques and raw power are some of the ways previously anonymous data sets have become de-anonymized; The data subjects are no longer anonymous.
De-anonymization is the reverse process in which anonymous data is cross-referenced with other data sources to re-identify the anonymous data source.{{cite web|title=De-anonymization|url=http://whatis.techtarget.com/definition/de-anonymization-deanonymization|publisher=Whatis.com|access-date=17 January 2014}}
Generalization and perturbation are the two popular anonymization approaches for relational data.{{cite journal|last=Bin Zhou|author2=Jian Pei |author3=WoShun Luk |title=A brief survey on anonymization techniques for privacy preserving publishing of social network data|journal=Newsletter ACM SIGKDD Explorations Newsletter|date=December 2008|volume=10|issue=2|pages=12–22|doi=10.1145/1540276.1540279 |s2cid=609178 |url=https://www.cs.sfu.ca/~jpei/publications/SocialNetworkAnonymization_survey.pdf}} The process of obscuring data with the ability to re-identify it later is also called pseudonymization and is one way companies can store data in a way that is HIPAA compliant.
However, according to ARTICLE 29 DATA PROTECTION WORKING PARTY, Directive 95/46/EC refers to anonymisation in Recital 26 "signifies that to anonymise any data, the data must be stripped of sufficient elements such that the data subject can no longer be identified. More precisely, that data must be processed in such a way that it can no longer be used to identify a natural person by using “all the means likely reasonably to be used” by either the controller or a third party. An important factor is that the processing must be irreversible. The Directive does not clarify how such a de-identification process should or could be performed. The focus is on the outcome: that data should be such as not to allow the data subject to be identified via “all” “likely” and “reasonable” means. Reference is made to codes of conduct as a tool to set out possible anonymisation mechanisms as well as retention in a form in which identification of the data subject is “no longer possible”.{{cite web| title=Opinion 05/2014 on Anonymisation Techniques| url=https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf| publisher=EU Commission| date=10 April 2014| access-date=31 December 2023}}
There are five types of data anonymization operations: generalization, suppression, anatomization, permutation, and perturbation.{{Cite journal|last1=Eyupoglu|first1=Can|last2=Aydin|first2=Muhammed|last3=Zaim|first3=Abdul|last4=Sertbas|first4=Ahmet|date=2018-05-17|title=An Efficient Big Data Anonymization Algorithm Based on Chaos and Perturbation Techniques|journal=Entropy|volume=20|issue=5|pages=373|doi=10.3390/e20050373|pmid=33265463|pmc=7512893|bibcode=2018Entrp..20..373E|issn=1099-4300|doi-access=free}} 50px Text was copied from this source, which is available under a [https://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International License].
GDPR requirements
The European Union's General Data Protection Regulation (GDPR) requires that stored data on people in the EU undergo either anonymization or a pseudonymization process.{{Cite book |last=Skiera |first=Bernd |url=https://www.worldcat.org/oclc/1303894344 |title=The impact of the GDPR on the online advertising market |date=2022 |others=Klaus Miller, Yuxi Jin, Lennart Kraft, René Laub, Julia Schmitt |isbn=978-3-9824173-0-1 |location=Frankfurt am Main |oclc=1303894344}} GDPR Recital (26) establishes a very high bar for what constitutes anonymous data, thereby exempting the data from the requirements of the GDPR, namely “…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” The European Data Protection Supervisor (EDPS) and the Spanish Agencia Española de Protección de Datos (AEPD) have issued joint guidance related to requirements for anonymity and exemption from GDPR requirements. According to the EDPS and AEPD, no one, including the data controller, should be able to re-identify data subjects in a properly anonymized dataset.{{cite web| title=Introduction to the Hash Function as a Personal Data Pseudonymisation Technique| url=https://edps.europa.eu/sites/edp/files/publication/19-10-30_aepd-edps_paper_hash_final_en.pdf| publisher=Spanish Data Protection Authority| date=October 2019| access-date=31 December 2023}} Research by data scientists at Imperial College in London and UCLouvain in Belgium,{{Cite news|url=https://www.nytimes.com/2019/07/23/health/data-privacy-protection.html?smid=nytcore-ios-share|title=Your Data Were 'Anonymized'? These Scientists Can Still Identify You|newspaper=The New York Times|date=23 July 2019|last1=Kolata|first1=Gina}} as well as a ruling by Judge Michal Agmon-Gonen of the Tel Aviv District Court,{{Cite web|url=https://www.law.co.il/computer-law/2019/07/01/nursing-companies-vs-ministry-of-defense/|title=Attm (TA) 28857-06-17 Nursing Companies Association v. Ministry of Defense| publisher=Pearl Cohen| date=2019| access-date=31 December 2023| language=yiddish}} highlight the shortcomings of "Anonymisation" in today's big data world. Anonymisation reflects an outdated approach to data protection that was developed when the processing of data was limited to isolated (siloed) applications, prior to the popularity of big data processing involving the widespread sharing and combining of data.{{Cite web|url=https://www.timesofisrael.com/data-is-up-for-grabs-under-outdated-israeli-privacy-law-think-tank-says| title=Data is up for grabs under outdated Israeli privacy law, think tank says| author=Solomon, S.| website=The Times of Israel| date=31 January 2019| access-date=31 December 2023}}
Anonymization of different types of data
Structured data:
Unstructured data:
- PDF files - Anonymization of text, tables, images, scanned pages.
- DICOM - Anonymization metadata, pixel data, overlay data, encapsulated documents.{{Cite web|url=https://apicom.pro/dicom-de-identification/ |title=DICOM De-identification/Anonymization: Protecting Patient Privacy in Medical Imaging| date=2024| language=english}}
- Images
Removing identifying metadata from computer files is important for anonymizing them. Metadata removal tools are useful for achieving this.
See also
References
{{reflist}}
Further reading
- {{cite book|last=Raghunathan|first=Balaji|title=The Complete Book of Data Anonymization: From Planning to Implementation|date=June 2013|publisher=CRC Press|isbn=9781482218565}}
- {{cite book|last=Khaled El Emam, Luk Arbuckle|title=Anonymizing Health Data: Case Studies and Methods to Get You Started|date=August 2014|publisher=O'Reilly Media|isbn=978-1-4493-6307-9}}
- {{cite book|last=Rolf H. Weber, Ulrike I. Heinrich|title=Anonymization: SpringerBriefs in Cybersecurity|year=2012|publisher=Springer|isbn=9781447140665}}
- {{cite book|last=Aris Gkoulalas-Divanis, Grigorios Loukides|title=Anonymization of Electronic Medical Records to Support Clinical Analysis (SpringerBriefs in Electrical and Computer Engineering)|year=2012|publisher=Springer|isbn=9781461456674}}
- {{cite web|last=Pete Warden|title=Why you can't really anonymize your data|url=http://strata.oreilly.com/2011/05/anonymize-data-limits.html|publisher=O'Reilly Media, Inc.|access-date=17 January 2014|archive-url=https://web.archive.org/web/20140109052803/http://strata.oreilly.com/2011/05/anonymize-data-limits.html|archive-date=9 January 2014|url-status=dead}}
External links
- On the anonymization of Internet traffic: [http://www.caida.org/data/anonymization/index.xml Data Sharing and Anonymization Reading List]