Software:Apache Tika

Tika
Developer(s)	Apache Software Foundation
Repository	Tika Repository
Written in	Java
Operating system	Cross-platform
Type	Search and index API
License	Apache License 2.0
Website	tika.apache.org

Short description: Open-source content analysis framework

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation.^[1] It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

History

The project originated as part of the Apache Nutch codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by content management systems, other Web crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting.^[2] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types. For most of the more common and popular formats,^[3] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using the OCR software Tesseract.^[4]

While Tika is written in Java, it is widely used from other languages.^[5] The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO),^[6] Goldman Sachs,^[7] NASA and academic researchers^[8] and by major content management systems including Drupal,^[9] and Alfresco^[10] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016^[11] Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore shell corporations. The leaked documents and the project to analyze them is referred to as the Panama Papers.

References

0.00

(0 votes)

Original source: https://en.wikipedia.org/wiki/Apache Tika. Read more

[1] "Apache Tika". http://tika.apache.org/.

[2] "Tika Proposal". http://wiki.apache.org/incubator/TikaProposal.

[3] "The Apache Software Foundation". http://tika.apache.org/1.12/formats.html.

[4] "TikaOCR". Apache Tika. 2019-03-26. https://cwiki.apache.org/confluence/display/tika/TikaOCR.

[5] "API Bindings for Tika". Apache Tika. https://wiki.apache.org/tika/API%20Bindings%20for%20Tika.

[6] "FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO". http://www.fico.com/en/newsroom/fico-to-engage-kaggles-community-of-180000-data-scientists-to-drive-innovation-in-the-fico-analytic-cloud.

[7] "Goldman Sachs Puts Elasticsearch To Work - InformationWeek" (in en). InformationWeek. http://www.informationweek.com/software/enterprise-applications/goldman-sachs-puts-elasticsearch-to-work/d/d-id/1321778.

[8] "Studying polar data with the help of Apache Tika". https://opensource.com/life/15/4/interview-annie-burgess-USC-JPL.

[9] "Text Extract for Drupal using Tika | Drupal.org". 30 July 2012. https://www.drupal.org/project/text_extract.

[10] "Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki". 5 June 2015. https://wiki.alfresco.com/wiki/Content_Transformation_and_Metadata_Extraction_with_Apache_Tika.

[11] Fox-Brewster, Thomas. "From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers". https://www.forbes.com/sites/thomasbrewster/2016/04/05/panama-papers-amazon-encryption-epic-leak.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

v t e Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airflow Ambari Ant Apex Aries Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Buildr Calcite Camel CarbonData Cassandra Cayenne Chemistry CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Empire-db Felix Flex Flink Flume Forrest Geronimo Giraph Gump Hadoop Hama HBase Helix Hive Impala Jackrabbit James Jini JMeter Kafka Karaf Kudu Kylin Lucene Mahout Marmotta Maven MINA mod perl MyFaces NetBeans Nutch ODE OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro SINGA Sling Solr Spark Stanbol Storm SpamAssassin Sqoop Struts 1 Struts 2 Subversion SystemML Tapestry Thrift Tika Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	MXNet Taverna XAP
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera AxKit Beehive Bluesky iBATIS Cactus Click Continuum Deltacloud Etch Excalibur Harmony HiveMind Jakarta Lenya Shale Shindig Slide stdcxx Tuscany Wave Wink XMLBeans
Licenses	Apache License
Category

Anonymous

Search

Software:Apache Tika

Namespaces

More

Page actions

Contents

History

Features

Notable uses

See also

References

Navigation

Navigation

Resources

Help

googletranslator

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Software:Apache Tika

History

Features

Notable uses

See also

References

Navigation

Wiki tools

Page tools

Other projects

Categories