OCRopus
{{Use dmy dates|date=August 2021}}
{{Infobox software
| name = OCRopus
| logo =
| screenshot = 320x178px
| caption =
| author =
| developer = Thomas Breuel, DFKI
| released = {{Start date and age|df=yes|2007|04|09}}{{Cite news|url=https://developers.googleblog.com/2007/04/announcing-ocropus-open-source-ocr.html|title=Announcing the OCRopus Open Source OCR System|last=Breuel|first=Thomas|date=2007-04-09|work=Google Developers Blog|access-date=2017-12-29|language=en-US}}
| latest release version = {{wikidata|property|preferred|references|edit|P348|P548=Q2804309}}
| latest release date = {{Start date and age|{{wikidata|qualifier|preferred|single|P348|P548=Q2804309|P577}}|df=yes}}
| latest preview version = ocropus4
| latest preview date =
| programming language = C++ and Python
| operating system = FreeBSD, Linux, Mac OS X
| platform =
| language =
| status =
| genre = Optical character recognition
| license = Apache License v2.0
| website = {{url|https://ocropus.github.io/}}
}}
OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License v2.0 with a very modular design using command-line interfaces.
OCRopus is developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and was sponsored by Google.
Description
OCRopus was especially designed for use in high-volume digitization projects of books, such as Google Books, Internet Archive, or libraries. A large number of languages and fonts are to be supported.{{Cite book|last=Breuel|first=Thomas|title=Proceedings of the International Workshop on Multilingual OCR - MOCR '09|chapter=Recent progress on the OCRopus OCR system|date=2009 |location=New York, NY, USA|publisher=ACM|pages=2:1–2:10|doi=10.1145/1577802.1577805|isbn=9781605586984|s2cid=16920122}} However, it can also be used for desktop and office applications or for application for visually impaired people.
OCRopus has main components which perform:
- Document layout analysis
- Optical character recognition
- Application of statistical language models
Single or multiple scripts are available for these components. The modular programming approach allows individual workflows to be used and individual steps to be exchanged.
By default, OCRopus comes with a model for English texts and a model for text in Fraktur. These models refer to the script and are largely independent of the actual language.{{Cite web|url=https://github.com/ocropus/ocropy/wiki/Models|title=Models|website=ocropy wiki|access-date=2018-01-05}} New characters or language variants can be trained either from the start, or addeded later.
Recent text recognition is based on recurrent neural networks (LSTM) and does not require a language model. This makes it possible to train language-independent models for which good recognition results in English, German and French have been shown at the same time.{{Cite book|last1=Ul-Hasan|first1=Adnan|last2=Breuel|first2=Thomas M.|title=Proceedings of the 4th International Workshop on Multilingual OCR - MOCR '13|chapter=Can we build language-independent OCR using LSTM networks?|date=2013 |location=New York, NY, USA|publisher=ACM|pages=9:1–9:5|doi=10.1145/2505377.2505394|isbn=9781450321143|s2cid=15054318}} In addition to the Latin script, there are results for other scripts such as Sanskrit, Urdu, Devanagari, and Greek.
Very good detection rates can be achieved through an appropriate training. This extra effort is particularly worthwhile for difficult documents or scripts that are no longer common today, which are not in the focus of other OCR software.{{Cite journal|last=Springmann|first=Uwe|date=2016-12-01|title=OCR für alte Drucke|journal=Informatik-Spektrum|language=de|volume=39|issue=6|pages=459–462|doi=10.1007/s00287-016-1004-3|s2cid=26680054|issn=0170-6012}}{{Cite book|last1=Simistira|first1=F.|last2=Ul-Hassan|first2=A.|last3=Papavassiliou|first3=V.|last4=Gatos|first4=B.|last5=Katsouros|first5=V.|last6=Liwicki|first6=M.|title=2015 13th International Conference on Document Analysis and Recognition (ICDAR) |chapter=Recognition of historical Greek polytonic scripts using LSTM networks |date=August 2015|pages=766–770|doi=10.1109/icdar.2015.7333865|isbn=978-1-4799-1805-8|s2cid=39049104|url=http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-72205 }}
History
On 9 April 2007, OCRopus was announced as a Google-sponsored project to develop advanced OCR technologies. Funding was granted for a period of three years and covered in particular PhD and postdoctoral positions at DFKI and the University of Kaiserslautern. In return, OCRopus was also used for automatic text recognition in Google Book Search.{{Cite web|url=https://www.dfki.de/web/forschung/projekte?pid=396&cl=en|title=Research project OCRopus|website=dfki.de|language=en|access-date=2018-01-05}} Licensing under an open source license was made right from the start to facilitate collaboration between industrial and academic research.{{Cite book|last=Breuel|first=Thomas M.|editor1-first=Berrin A|editor1-last=Yanikoglu|editor2-first=Kathrin|editor2-last=Berkner|title=Document Recognition and Retrieval XV|chapter=The OCRopus open source OCR system|date=2008-01-28 |volume=6815|pages=68150F–68150F–15|doi=10.1117/12.783598|series=Document Recognition and Retrieval XV|citeseerx=10.1.1.99.8505|bibcode=2008SPIE.6815E..0FB|s2cid=14728635}} OCRopus has received further funding from the Andrew W. Mellon Foundation and the BMBF.{{Cite web|url=http://code.google.com:80/p/ocropus#Acknowledgements|title=ocropus project website|date=January 2019|website=Google Project Hosting|archive-url=https://web.archive.org/web/20121224105419/http://code.google.com/p/ocropus#Acknowledgements|archive-date=24 December 2012|url-status=dead}}
The first alpha version 0.1 was released on 22 October 2007 and several pre-releases followed between December 2007 and May 2009 reaching a stable version 0.4.4 in March 2010.{{Cite web|url=https://github.com/ocropus/ocropy/wiki/Older-versions|title=Older versions - ocropy|website=GitHub|access-date=2018-01-05}} Originally, the software was developed in C++, Python and Lua with Jam as a build system. A complete refactoring of the source code in Python modules was done and released in version 0.5 (June 2012).{{Cite web|url=https://groups.google.com/forum/#!topic/ocropus/S73OMtJdVmw/discussion|title=OCRopus 0.5|date=2012-06-02|website=Google Groups}}
Initially, Tesseract was used as the only text recognition module. Since 2009 (version 0.4) Tesseract was only supported as a plugin. Instead, a self-developed text recognizer (also segment-based) was used.[http://groups.google.com/group/ocropus/msg/96c4081a3213dbcc OCRopus doesn't even link with Tesseract by default]. This recognizer was then used together with OpenFST[http://www.openfst.org/ Official OpenFST website]. for language modeling after the recognition step. From 2013 onwards, an additional recognition with recurrent neural networks (LSTM) was offered, which with the release of version 1.0 in November 2014 is the only recognizer.{{Cite web|url=https://github.com/ocropus/ocropy/releases/tag/v1.0|title=ocropy - release v1.0|date=2014-11-02|website=GitHub|access-date=2018-01-05}}{{Cite book|last1=Breuel|first1=T. M.|last2=Ul-Hasan|first2=A.|last3=Al-Azawi|first3=M. A.|last4=Shafait|first4=F.|title=2013 12th International Conference on Document Analysis and Recognition |chapter=High-Performance OCR for Printed English and Fraktur Using LSTM Networks |date=August 2013|pages=683–687|doi=10.1109/icdar.2013.140|isbn=978-0-7695-4999-6|s2cid=7244356}}
The source code is managed over GitHub and is maintained and developed by a developer community.{{Citation|title=ocropy: Python-based tools for document analysis and OCR|url=https://github.com/ocropus/ocropy|access-date=2018-01-05|website=GitHub}} The current version of OCRopus is 1.3.3 (December 2017).{{Cite web|url=https://github.com/ocropus/ocropy/releases|title=Releases ocropy|website=GitHub|access-date=2018-01-05}}
The OCR software kraken which is used by the transcription platform eScriptorium is a fork of OCRopus. It added support for right-to-left scripts.{{Cite web|url=https://dh-abstracts.library.virginia.edu/works/9912|title=Kraken - a Universal Text Recognizer for the Humanities|access-date=2024-01-23}} Another fork which is based on kraken is Calamari.
Thomas Breuel also developed a successor OCRopus 2 and is actively working on OCRopus 4.{{Cite web|url=https://github.com/ocropus/|title=The OCRopus OCR System and Related Software|website=GitHub|access-date=2021-08-27}}
Usage
OCRopus can be used from the command line. Once installed, it can be invoked by specifying the input images. It will output the recognized text to standard output directly or write it as hOCR (HTML-based) code into files, from which it then can be transformed to a searchable PDF. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).{{Cite web|url=https://github.com/ocropus/ocropy/wiki|title=ocropy wiki|website=GitHub|access-date=2017-12-30}}
Example for the OCRopus calls to recognize the text in an image:
# perform binarization
ocropus-nlbin tests/ersch.png -o book
# perform page layout analysis
ocropus-gpageseg book/0001.bin.png
# perform text line recognition (with a fraktur model)
ocropus-rpred -m models/fraktur.pyrnn.gz book/0001/*.bin.png
# generate HTML output
ocropus-hocr book/0001.bin.png -o book/0001.html
Other tools concentrate on the training part of OCRopus. There are OCRopus models to extract text from Latin, Greek, Cyrillic and Indic scripts.{{Cite web|url=https://github.com/ocropus/ocropy/wiki/Models|title=ocropy models|website=GitHub|language=en|access-date=2018-03-13}}
References
{{reflist}}
External links
{{Portal|Free and open-source software}}
- {{github|ocropus/ocropy}}
- [https://github.com/ocropus/ocropy/wiki Ocropy wiki on GitHub]
- [http://pubs.iupr.org/ IUPR Publication Server] (papers behind many of the algorithms used in OCRopus)
{{OCR}}
Category:Free software programmed in C++
Category:Free software programmed in Python