OpenRefine
{{Short description|Application for data cleanup and data transformation}}
{{Infobox software
| name = OpenRefine
| title = OpenRefine
| logo = OpenRefine favicon (2018-present).svg
| logo size = 200px
| developer = Freebase, then Google, now open source community
| released = {{Start date and age|2010|11|10}}
| latest release version = {{wikidata|property|edit|reference|P348}}
| latest release date = {{start date and age|{{wikidata|qualifier|P348|P577}}}}
| programming language = Java{{cite web|title=OpenRefine/OpenRefine - GitHub|website=GitHub|url=https://github.com/OpenRefine/OpenRefine|access-date=25 June 2017}}
| platform = Microsoft Windows, Linux, macOS
| language = English, Italian, Chinese, Japanese, French, German
| genre = {{Plainlist|
}}
| license = BSD License
}}
OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling.{{Cite web|url=http://openrefine.org/|title=openrefine.github.com|website=openrefine.org}} It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, but it behaves more like a database.
It operates on rows of data which have cells under columns, similar to the manner in which relational database tables operate. OpenRefine projects consist of one table, whose rows can be filtered using facets that define criteria (for example, showing rows where a given column is not empty).
Unlike spreadsheets, most operations in OpenRefine are done on all visible rows, for example, the transformation of all cells in all rows under one column,{{cite web|title=Editing by transforming: Cell Editing wiki page from Refine documentation|url=https://code.google.com/p/google-refine/wiki/CellEditing#Editing_by_Transforming|access-date=18 April 2012}} or the creation of a new column based on existing data. Actions performed on a dataset are stored the project and can be 'replayed' on other datasets. Formulas are not stored in cells, but are used to transform the data. Transformation is done only once.{{cite web|title=Comparison with spreadsheet software: Cell Editing wiki page in Refine documentation|url=https://code.google.com/p/google-refine/wiki/CellEditing#Comparison_with_Spreadsheets_Software|access-date=18 April 2012}} Formula expressions can be written in General Refine Expression Language (GREL),[https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language General Refine expression language OpenRefine/OpenRefine Wiki GitHub]. Github.com (2013-04-03). Retrieved on 2013-08-16. in Jython (i.e., Python), and in Clojure.{{cite web|title=Expressions: Refine documentation|url=https://code.google.com/p/google-refine/wiki/DocumentationForUsers#Expressions|access-date=18 April 2012}}
The program operates as a local web app: it starts a web server and opens the default browser to 127.0.0.1:3333.
Uses
- Cleaning messy data: for example if working with a text file with some semi-structured data, it can be edited using transformations, facets and clustering to make the data cleanly structured.{{cite web|title=Screencast: Google Refine 2.0 - Introduction (1 of 3) - editing government data|website = YouTube| date=19 July 2011 |url=https://www.youtube.com/watch?v=B70J_H_zAWM|access-date=18 April 2012}}
- Transformation of data: converting values to other formats, normalizing and denormalizing.
- Parsing data from web sites: OpenRefine has a URL fetch feature and jsoup HTML parser and DOM engine.{{cite web|title=Stripping HTML: Refine documentation wiki page|url=https://code.google.com/p/google-refine/wiki/StrippingHTML|access-date=18 April 2012}}
- Adding data to dataset by fetching it from web services (i.e. returning JSON).{{cite web|title=FetchingURLsFromWebServices wiki page: Refine documentation|url=https://code.google.com/p/google-refine/wiki/FetchingURLsFromWebServices|access-date=18 April 2012}} For example, can be used for geocoding addresses to geographic coordinates.{{cite web|title=Screencast: Google Refine 2.0 - Data Augmentation (3 of 3) - using Openstreetmap Nominatim for geocoding and Freebase for augmentation|website = YouTube| date=19 July 2011 |url=https://www.youtube.com/watch?v=5tsyz3ibYzk|access-date=18 April 2012}}
- Aligning to Wikidata (formerly Freebase{{cite web|title=Schema Alignment: Refine documentation wiki page|url=https://code.google.com/p/google-refine/wiki/SchemaAlignment|access-date=18 April 2012}}): this involves reconciliation — mapping string values in cells to entities in Wikidata.{{cite web|url=https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation|access-date=12 March 2017|title=OpenRefine documentation: Reconciliation|website=GitHub}}
Supported formats
Import is supported from following formats:{{cite web|title=Importers: Refine documentation wiki page|url=https://code.google.com/p/google-refine/wiki/Importers|access-date=18 April 2012}}
- TSV, CSV
- Text file with custom separators or columns split by fixed width
- XML
- RDF triples (RDF/XML and Notation3 serialization formats)
- JSON
- Google Spreadsheets{{cite web|title=Changelog for 2.5|url=https://code.google.com/p/google-refine/wiki/ChangesFor2p5|access-date=18 April 2012}}
If input data is in a non-standard text format, it can be imported as whole lines, without splitting into columns, and then columns extracted later with OpenRefine's tools. Archived and compressed files are supported (.zip, .tar.gz, .tgz, .tar.bz2, .gz, or .bz2) and Refine can download input files from a URL. To use web pages as input, it is possible to import a list of URLs and then invoke a URL fetch function.
Export is supported in following formats:{{cite web|title=Exporting: Refine documentation wiki page|url=https://code.google.com/p/google-refine/wiki/Exporters|access-date=18 April 2012}}
- TSV
- CSV
- Microsoft Excel
- HTML table
- Google Spreadsheets
- Templating exporter: it is possible to define custom template for outputting data, for example as MediaWiki table.
Whole OpenRefine projects in native format can be exported as a .tar.gz archive.
Development
OpenRefine started life as Freebase Gridworks, developed by Metaweb and has been available as open source since January 2010.{{Cite web|url=https://code.google.com/archive/p/google-refine/source|title=Google Code Archive - Long-term storage for Google Code Project Hosting.|website=code.google.com}} On 16 July 2010, Google acquired Metaweb,{{cite news|title=Google Official Blog: Deeper understanding with Metaweb|url=http://googleblog.blogspot.com/2010/07/deeper-understanding-with-metaweb.html|access-date=18 April 2012}} the creators of Freebase, and on 10 November 2010 renamed Freebase Gridwords Google Refine, releasing version 2.0.{{cite news|title=Google Opensource blog: Announcing Google Refine 2.0, a power tool for data wranglers|url=http://google-opensource.blogspot.com/2010/11/announcing-google-refine-20-power-tool.html|access-date=18 April 2012}} On 2 October 2012, original author David Huynh announced that Google would soon stop its active support of Google Refine.{{Cite web|url=https://groups.google.com/forum/#!topic/openrefine/a3R6afKb4-4|title=Google Groups|website=groups.google.com}}{{Cite web|url=http://kb.refinepro.com/2012/10/from-freebase-gridworks-to-google.html|title=From Freebase Gridworks to Google Refine and now OpenRefine}}[http://openrefine.org/OpenRefine/community OpenRefine] {{Webarchive|url=https://web.archive.org/web/20160925212518/http://openrefine.org/OpenRefine/community |date=2016-09-25 }}. OpenRefine. Retrieved on 2013-08-16. Since then, the codebase has been in transition to an open source project named OpenRefine.[https://code.google.com/p/google-refine/ google-refine - Google Refine, a power tool for working with messy data (formerly Freebase Gridworks) - Google Project Hosting]. Code.google.com. Retrieved on 2013-08-16.
References
{{Reflist|35em}}
External links
- {{Official website|http://openrefine.org/}}
- [https://commons.wikimedia.org/wiki/File:OpenRefine_Beginners_Tutorial_by_Emma_Carroll.webm#%7B%7Bint%3Alicense-header%7D%7D OpenRefine Beginners Tutorial by Emma Carroll]
{{Google FOSS}}
Category:Free software programmed in Java (programming language)
Category:Data management software
Category:Extract, transform, load tools
Category:Cross-platform free software
Category:Free software for Linux
Category:Free software for Windows