Apache Tika Improves Security
Written by Kay Ewbank   
Monday, 04 April 2022

Apache TIka 2.3 has been released with improvements including security upgrades to several dependencies, and a move to using Apache POI 5.2.

Tika is a content analysis toolkit for detecting and extracting metadata and text. It can be used to extract metadata from over a thousand different file types, all of which can be parsed through a single interface, making Tika useful for search engine indexing, content analysis and translation.

tika

Tika has a Java library as well as server and command line tools. It uses a number of document parsers and document type detection techniques to detect and extract data.

Apache POI used to be part of the Jakarta Project, and provides Java APIs for reading and writing files in the Office Open XML standards (OOXML) and OLE 2 Microsoft Office formats. The move to using POI 5.x in Tika represents a major refactoring, according to the developers, who also say that users may experience significantly more logging.

The new release also includes several security upgrades in dependencies, including an upgrade to log4j2 to overcome the security vulnerabilities known about in log4j.

Most of the other work has been to the Tika parsers, particularly to the PDF parser so that it now extracts annotation types, subtypes and 3D annotations into metadata. There's a new parser for Translation Memory eXchange (TMX) files, another for IDML, and an improvement to the identification of iWorks 13 files to add parsing for thumbnails, some metadata and attachments.

Tika Config has changes to improve the configuration of maps (key/value attributes) as parameters for parsers. Another change has been to all the parsers for embedded files to Improve consistency in the reporting of package-entry divs. The team says this will lead to some more text, specifically embedded file names, in files with many embedded attachments. 

Tika 2.3 is available now. 

 tika

More Information

Tika Website

Related Articles

Apache Tika 2 Adds New Pipes Modules

Apache Kafka 2.7 Updates Broker

Tika in Action

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Paul Allen's Living Computers Sold For Record Prices
04/10/2024

Auction house Christie's of New York broke existing records for sales of rare and iconic computers when it sold the collection that the late Paul Allen, co-founder of Microsoft, had assembled to  [ ... ]



Take Microsoft's Python Web Apps Course For Free
17/09/2024

Microsoft has launched a free self paced course on building web applications with Python, addressed to total beginners.


More News

kotlin book

 

Comments




or email your comment to: comments@i-programmer.info