Apache Tika

{{Short description|Open-source content analysis framework}}

{{Infobox software

| name = Tika

| logo = Apache Tika Logo.svg

| screenshot =

| caption =

| developer = Apache Software Foundation

| latest release version = {{wikidata|property|edit|reference|P548=Q2804309|P348}}

| latest release date = {{start date and age|{{wikidata|qualifier|P548=Q2804309|P348|P577}}}}

| latest preview version =

| latest preview date =

| repo = {{URL|https://gitbox.apache.org/repos/asf?p{{=}}tika.git|Tika Repository}}

| programming language = Java

| operating system = Cross-platform

| platform =

| genre = Search and index API

| license = Apache License 2.0

| website = {{URL|http://tika.apache.org/}}

}}

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.{{Cite web|url=http://tika.apache.org/|title=Apache Tika|access-date=2016-04-15}} It detects and extracts metadata and text from over a thousand different file types, and as well as providing a

Java library, has server and command-line editions suitable for use from other programming languages.

History

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting.{{Cite web|url=http://wiki.apache.org/incubator/TikaProposal|title=Tika Proposal|access-date=2016-04-15}} In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,{{cite web|url=http://tika.apache.org/1.12/formats.html| title= The Apache Software Foundation| website=Apache Tika formats page|access-date=16 April 2016}} Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract.{{Cite web|url=https://cwiki.apache.org/confluence/display/tika/TikaOCR|date=2019-03-26|publisher=Apache Tika|title=TikaOCR|access-date=2019-12-02}}

While Tika is written in Java, it is widely used from other languages.{{Cite web|url=https://wiki.apache.org/tika/API%20Bindings%20for%20Tika|title=API Bindings for Tika|publisher=Apache Tika|access-date=2016-04-17}} The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),{{Cite web|url=http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud|title=FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud {{!}} FICO|website=FICO {{!}} Decisions|access-date=2016-04-15|archive-url=https://web.archive.org/web/20160603111240/http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud|archive-date=2016-06-03|url-status=dead}} Goldman Sachs,{{Cite news|url=http://www.informationweek.com/software/enterprise-applications/goldman-sachs-puts-elasticsearch-to-work/d/d-id/1321778|title=Goldman Sachs Puts Elasticsearch To Work - InformationWeek|work=InformationWeek|access-date=2017-06-21|language=en}} NASA and academic researchers{{Cite web|url=https://opensource.com/life/15/4/interview-annie-burgess-USC-JPL|title=Studying polar data with the help of Apache Tika|website=Opensource.com|access-date=2016-04-15}} and by major content management systems including Drupal,{{Cite web|url=https://www.drupal.org/project/text_extract|title=Text Extract for Drupal using Tika {{!}} Drupal.org|website=www.drupal.org|date=30 July 2012 |access-date=2016-04-15}} and Alfresco (software){{Cite web|url=https://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata_Extraction_with_Apache_Tika|title=Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki|website=wiki.alfresco.com|date=5 June 2015 |access-date=2016-04-15}} to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016{{Cite web|url=https://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak|title=From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers|last=Fox-Brewster|first=Thomas|website=Forbes|access-date=2016-04-15}} Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

See also

References