Apache Tika Improves Security
Written by Kay Ewbank   
Monday, 04 April 2022

Apache TIka 2.3 has been released with improvements including security upgrades to several dependencies, and a move to using Apache POI 5.2.

Tika is a content analysis toolkit for detecting and extracting metadata and text. It can be used to extract metadata from over a thousand different file types, all of which can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.


Tika has a Java library as well as server and command line tools. It uses a number of document parsers and document type detection techniques to detect and extract data.

Apache POI used to be part of the Jakarta Project, and provides Java APIs for reading and writing files in the Office Open XML standards (OOXML) and OLE 2 Microsoft Office formats. The move to using POI 5.x in Tika represents a major refactoring, according to the developers, who also say that users may experience significantly more logging.

The new release also includes several security upgrades in dependencies, including an upgrade to log4j2 to overcome the security vulnerabilities known about in log4j.

Most of the other work has been to the Tika parsers, particularly to the PDF parser so that it now extracts annotation types, subtypes and 3D annotations into metadata. There's a new parser for Translation Memory eXchange (TMX) files, another for IDML, and an improvement to the identification of iWorks 13 files to add parsing for thumbnails, some metadata and attachments.

Tika Config has changes to improve the configuration of maps (key/value attributes) as parameters for parsers. Another change has been to all the parsers for embedded files to Improve consistency in the reporting of package-entry divs. The team says this will lead to some more text, specifically embedded file names, in files with many embedded attachments. 

Tika 2.3 is available now. 


More Information

Tika Website

Related Articles

Apache Tika 2 Adds New Pipes Modules

Apache Kafka 2.7 Updates Broker

Tika in Action


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


Chatbots Hallucinate - Word Of The Year

The propensity of AI-powered chatbots to provide misinformation is referred to as "hallucinating" and is something that has come to popular attention - to such an extent that the verb "hallucinate" ha [ ... ]

Microsoft Announces New Tools at Ignite

Microsoft announced a range of new features for its tools at its annual Ignite conference, with new ways to customize Microsoft 365 and extra AI facilities for Azure.

More News




or email your comment to: comments@i-programmer.info