Apache Tika Improves Security
Written by Kay Ewbank   
Monday, 04 April 2022

Apache TIka 2.3 has been released with improvements including security upgrades to several dependencies, and a move to using Apache POI 5.2.

Tika is a content analysis toolkit for detecting and extracting metadata and text. It can be used to extract metadata from over a thousand different file types, all of which can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.

tika

Tika has a Java library as well as server and command line tools. It uses a number of document parsers and document type detection techniques to detect and extract data.

Apache POI used to be part of the Jakarta Project, and provides Java APIs for reading and writing files in the Office Open XML standards (OOXML) and OLE 2 Microsoft Office formats. The move to using POI 5.x in Tika represents a major refactoring, according to the developers, who also say that users may experience significantly more logging.

The new release also includes several security upgrades in dependencies, including an upgrade to log4j2 to overcome the security vulnerabilities known about in log4j.

Most of the other work has been to the Tika parsers, particularly to the PDF parser so that it now extracts annotation types, subtypes and 3D annotations into metadata. There's a new parser for Translation Memory eXchange (TMX) files, another for IDML, and an improvement to the identification of iWorks 13 files to add parsing for thumbnails, some metadata and attachments.

Tika Config has changes to improve the configuration of maps (key/value attributes) as parameters for parsers. Another change has been to all the parsers for embedded files to Improve consistency in the reporting of package-entry divs. The team says this will lead to some more text, specifically embedded file names, in files with many embedded attachments. 

Tika 2.3 is available now. 

 tika

More Information

Tika Website

Related Articles

Apache Tika 2 Adds New Pipes Modules

Apache Kafka 2.7 Updates Broker

Tika in Action

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Google Adds Multiple Database Support To Firestore
04/03/2024

Google has announced the general availability of Firestore Multiple Databases, which can be used to manage multiple Firestore databases within a single Google Cloud project.



Opaque Systems Introduces Gateway GenAI Solution
14/03/2024

Opaque Systems has announced an early access program for Opaque Gateway, software designed to address data privacy, security, and sovereignty concerns in managing GenAI implementations.


More News

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info