data curation
{{Short description|Organization of collected data}}
Data curation is the organization and integration of data collected from various sources. It involves annotation, publication and presentation of the data so that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data".Renée J. Miller, [http://comad.in/comad2014/Proceedings/Keynote2.pdf “Big Data Curation”] in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India, December 17–19, 2014 In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database.[http://biocreative.sourceforge.net/biocreative_glossary.html Bio creative Glossary]. Retrieved on 3 October 2016.
In the modern era of big data, the curation of data has become more prominent, particularly for software processing high volume and complex data systems.{{cite book |title=Handbook of Data Intensive Computing |last=Furht |first=Borko |author2=Armando Escalante |year=2011 |publisher=Springer Science & Business Media |isbn=9781461414155 |page=32 |url=https://play.google.com/store/books/details?id=gsk6XpZgGYwC |access-date=2 October 2016}} The term is also used within the humanities,{{cite book |title=Digital Curation in the Digital Humanities: Preserving and Promoting Archival and Special Collections |last=Sabharwal |first=Arjun |year=2015 |publisher=Chandos Publishing |isbn=9780081001783 |page=60 |url=https://play.google.com/store/books/details?id=GpiKBAAAQBAJ |access-date=2 October 2016}} where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation."An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz http://guide.dhcuration.org/intro/. Not available any more: [https://web.archive.org/web/20170925151437/http://guide.dhcuration.org/contents/intro/ archive.org] In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component.[http://www.pilin.net.au/Project_Documents/Glossary.htm Pilin Glossary]. Not available any more: [https://web.archive.org/web/20120330202110/http://www.pilin.net.au/Project_Documents/Glossary.htm archive.org] Specifically, data curation is the attempt to determine what information is worth saving and for how long.{{Cite book|title=Big data, little data, no data: Scholarship in the networked world|last=Borgman|first=C|publisher=MIT Press|year=2015|isbn=978-0-262-02856-1|location=Cambridge, Massachusetts|pages=[https://archive.org/details/bigdatalittledat0000borg/page/13 13]|url=https://archive.org/details/bigdatalittledat0000borg/page/13}}
History and practice
The user, rather than the database itself, typically initiates data curation and maintains metadata. According to the University of Illinois' Graduate School of Library and Information Science, "Data curation is the active and on-going management of data through its lifecycle of interest and usefulness to scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality, add value, and provide for re-use over time."{{cite journal|last=Cragin|first=Melissa|author2=Heidorn, P. Bryan |author3=Palmer, Carole L. |author4= Smith, Linda C. |title=An Educational Program on Data Curation|journal=ALA Science & Technology Section Conference|year=2007|url=https://www.ideals.illinois.edu/handle/2142/3493|access-date=7 October 2013}} The data curation workflow is distinct from data quality management, data protection, lifecycle management, and data movement.{{cite book |title=Designing and Operating a Data Reservoir |last=Chessell |first=Mandy |author2=Nigel L Jones |author3=Jay Limburn |author4=David Radley |author5=Kevin Shank |year=2015 |publisher=IBM Redbooks |isbn=9780837440668 |pages=111–113 |url=https://play.google.com/store/books/details?id=-BWrCQAAQBAJ |access-date=2 October 2016}}
Census data has been available in tabulated punch card form since the early 20th century and has been electronic since the 1960s.{{Cite web|url=https://www.clir.org/wp-content/uploads/sites/6/2016/09/pub63watersgarrett.pdf|title=Preserving Digital Information (PDI) report|date=1996|access-date=2018-03-13}} The Inter-university Consortium for Political and Social Research (ICPSR) website marks 1962 as the date of their first Survey Data Archive.{{Cite web|url=https://www.icpsr.umich.edu/icpsrweb/content/about/history/|title=ICPSR: History|website=www.icpsr.umich.edu|language=en|access-date=2018-03-15}}
Deep background on data libraries appeared in a 1982 issue of the Illinois journal, Library Trends.{{Cite journal|url=https://www.ideals.illinois.edu/handle/2142/7218|title=Library Trends 30 (3) Winter 1982: Data Libraries for the Social Sciences|first=Kathleen M.|last=Heim|date=November 29, 1982|via=www.ideals.illinois.edu|journal=Library Trends}} For historical background on the data archive movement, see "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure."Kathleen M. Heim, "Social Scientific Information Needs for Numeric Data: The Evolution of the International Data Archive Infrastructure." in Collection Management 9 (Spring 1987): 1-53. The exact curation process undertaken within any organisation depends on the volume of data, how much noise the data contains, and what the expected future use of the data means to its dissemination.
The crises in space data led to the 1999 creation of the Open Archival Information System (OAIS) model,{{Cite web|url=https://www.oclc.org/research/publications/library/2000/lavoie-oais.html|title=The OAIS reference model|language=en-US|access-date=2018-03-15|date=2015-12-09}} stewarded by the Consultative Committee for Space Data Systems (CCSDS), which was formed in 1982.{{Cite web|url=https://public.ccsds.org/default.aspx|title=CCSDS.org - The Consultative Committee for Space Data Systems (CCSDS)|website=public.ccsds.org|access-date=2018-03-14}}
The term data curation is sometimes used in the context of biological databases, where specific biological information is firstly obtained from a range of research articles and then stored within a specific category of database. For instance, information about anti-depressant drugs can be obtained from various sources and, after checking whether they are available as a database or not, they are saved under a drug's database's anti-depressive category. Enterprises are also utilizing data curation within their operational and strategic processes to ensure data quality and accuracy.E. Curry, A. Freitas, and S. O’Riáin, [http://3roundstones.com/led_book/led-curry-et-al.html “The Role of Community-Driven Data Curation for Enterprises,”] {{webarchive|url=https://web.archive.org/web/20120123161104/http://3roundstones.com/led_book/led-curry-et-al.html |date=2012-01-23 }} in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25-47. {{ISBN|978-1-4419-7664-2}}A. Freitas, E. Curry, [https://www.insight-centre.org/sites/default/files/publications/newhorizons_online.pdf “Big Data Curation,”] {{Webarchive|url=https://web.archive.org/web/20160913163450/https://www.insight-centre.org/sites/default/files/publications/newhorizons_online.pdf |date=2016-09-13 }} in New Horizons for a Data-Driven Economy, Springer (Open Access), 2015.
Projects and studies
The Dissemination Information Packages (DIPS) for Information Reuse (DIPIR) project is studying research data produced and used by quantitative social scientists, archaeologists, and zoologists. The intended audience is researchers who use secondary data and the digital curators, digital repository managers, data center staff, and others who collect, manage, and store digital information.Dissemination Information Packages for Information Reuse (DIPIR) project http://www.oclc.org/research/themes/user-studies/dipir.html
The Protein Data Bank was established in 1971 at Brookhaven National Laboratory, and has grown into a global project.{{cite web|title=RCSB PDB: About the PDB Archive and the RCSB PDB|url=https://www.rcsb.org/pages/aboutus|website=About the PDB Archive and the RCSB PDB|access-date=15 March 2018}} A database for three-dimensional structural data of proteins and other large biological molecules, the PDB contains over 120,000 structures, all standardized, validated against experimental data, and annotated.
FlyBase, the primary repository of genetic and molecular data for the insect family Drosophilidae, dates back to 1992. FlyBase annotates the entire Drosophila melanogaster genome.{{cite journal|last1=Gramates|first1=LS|last2=Marygold|first2=SJ|last3=dos Santos|first3=G|last4=Urbano|first4=J-M|last5=Antonazzo|first5=G|last6=Matthews|first6=BB|last7=Rey|first7=AJ|last8=Tabone|first8=CJ|last9=Crosby|first9=MA|last10=Emmert|first10=DB|last11=Falls|first11=K|last12=Goodman|first12=JL|last13=Hu|first13=Y|last14=Ponting|first14=L|last15=Schroeder|first15=AJ|last16=Strelets|first16=VB|last17=Thurmond|first17=J|last18=Zhou|first18=P|last19=FlyBase Consortium|title=lyBase at 25: looking to the future|journal=Nucleic Acids Res.|date=2017|volume=45|issue=D1|pages=D663–D671|doi=10.1093/nar/gkw1016|pmid=27799470|pmc=5210523}}
The Linguistic Data Consortium is a data repository for linguistic data, dating back to 1992.{{cite web|title=About LDC |url=https://www.ldc.upenn.edu/about|website=Linguistic Data Consortium|access-date=15 March 2018}}
The Sloan Digital Sky Survey began surveying the night sky in 2000.{{cite web|title=Sloan Digital Sky Survey|url=http://www.sdss.org/|website=SDSS|access-date=15 March 2018}} Computer scientist Jim Gray, while working on the data architecture of the SDSS, championed the idea of data curation in the sciences.{{cite journal|last1=Palmer|first1=Carole L.|last2=Weber|first2=Nicholas M.|last3=Muñoz|first3=Trevor|last4=Renear|first4=Allen H.|title=Foundations of Data Curation: The Pedagogy and Practice of "Purposeful Work" with Research Data|journal=Archive Journal|date=June 2013|volume=3|hdl=2142/78099}}
DataNet was a research program of the U.S. National Science Foundation Office of Cyberinfrastructure, funding data management projects in the sciences.{{cite web
|url=https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141
|publisher=National Science Foundation
|title=Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Summary
|date=September 28, 2007
|access-date=March 15, 2018
}} DataONE (Data Observation Network for Earth) is one of the projects funded through DataNet, helping the environmental science community preserve and share data.{{cite web|title=What is DataONE?|url=https://www.dataone.org/what-dataone|website=What is DataONE?|access-date=15 March 2018|archive-date=26 April 2019|archive-url=https://web.archive.org/web/20190426165259/https://www.dataone.org/what-dataone|url-status=dead}}
See also
{{Portal|Literature}}
- Biocurator
- Data archaeology
- Data degradation
- Data format management
- Data preservation
- Data stewardship
- Data wrangling
- Digital curation{{snd}}the curation of published documents, rather than raw data
- Digital preservation
- Informationist{{snd}}an individual with extensive expertise in data curation
References
{{Reflist|2}}
External links
- Curation of ecological and environmental data: [http://www.dataone.org/ DataONE]
- Data management tools and services spanning multiple scientific disciplines: [http://www.dataconservancy.org/ DataConservancy]
{{data}}