|Apache Tika Improves Security|
|Written by Kay Ewbank|
|Monday, 04 April 2022|
Apache TIka 2.3 has been released with improvements including security upgrades to several dependencies, and a move to using Apache POI 5.2.
Tika is a content analysis toolkit for detecting and extracting metadata and text. It can be used to extract metadata from over a thousand different file types, all of which can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.
Tika has a Java library as well as server and command line tools. It uses a number of document parsers and document type detection techniques to detect and extract data.
Apache POI used to be part of the Jakarta Project, and provides Java APIs for reading and writing files in the Office Open XML standards (OOXML) and OLE 2 Microsoft Office formats. The move to using POI 5.x in Tika represents a major refactoring, according to the developers, who also say that users may experience significantly more logging.
The new release also includes several security upgrades in dependencies, including an upgrade to log4j2 to overcome the security vulnerabilities known about in log4j.
Most of the other work has been to the Tika parsers, particularly to the PDF parser so that it now extracts annotation types, subtypes and 3D annotations into metadata. There's a new parser for Translation Memory eXchange (TMX) files, another for IDML, and an improvement to the identification of iWorks 13 files to add parsing for thumbnails, some metadata and attachments.
Tika Config has changes to improve the configuration of maps (key/value attributes) as parameters for parsers. Another change has been to all the parsers for embedded files to Improve consistency in the reporting of package-entry divs. The team says this will lead to some more text, specifically embedded file names, in files with many embedded attachments.
Tika 2.3 is available now.
or email your comment to: email@example.com