Author: Chris Mattmann & Jukka Zitting
Aimed at: Java programmers
Pros: In depth
Cons: Not really hands on
Reviewed by: Alex Armstrong
Tika is an Apache Java Toolkit that lets you work with files in different formats. The authors of this book refer to it as a "Babel fish" of file formats and if you know "The Hitchhiker's Guide to the Galaxy" you will recognize that this is no small claim.
The book is divided into four parts: Getting Started - The case for the digital Babel fish; Tika in Detail; Integration and Advanced Use; and, finally, Case Studies.
For a book that is about a specific technology, it manages to stay fairly theoretical. Part 1 is just a general discussion of the file format problem. But not at the level of "Pdfs are difficult" or "look out for the version bytes in early doc files", but a sort of abstracted look at the whole format problem. The solution is, fairly obviously MIME. The only downside of MIME is that fact that is has "Mail" in its name which makes everyone think that it is only of use in the email context. MIME is more than something you use in an email, it is a general classification scheme for data formats and it is uses by Tika to identify and manipulate files.
In the same section we have a getting started with Tika chapter - download, install and use. The section closes with a look at information in the widest possible setting - covering AI, search engines and the size of the wide. This part of the book spends more time on wider non-Tika focused issues than it does on using Tika or code and this is fairly representative of the rest of the book. This is not a cookbook or a code focused book.
Part 2 is an in depth look at Tika that also ranges well beyond its basic material. The first chapter looks at MIME and media types with reference to the Tika Type detector. Then on to context extraction and parsers with a final example of exporting to XHTML. Chapter 6 is about metadata, chapter 7 is on language detection and Chapter 8 is called "What's in a file" - a question you might have expected to be raised and answered earlier than this. Missing from the discussion are the main "container" formats - TIFF, AVI etc. but then Tika doesn't support these at the moment.
Part 3 is about integration and advanced use. The first chapter is about using Tika in search engines which is an easy application as this is what Tika was designed for and chapter 10 looks at its use with its natural partner the Lucene search engine.
The final part is composed of case studies - Powering NASA science data systems, Content management with Apache Jackrabbit and Curating cancer research data with Tika.
Overall, the book tends to be theoretical rather than a hands on view of using Tika. If you are looking for code you might be disappointed. There is some, but not as much as you might expect from an "In Action" book. From my point of view what is missing is any serious discussion of making use of Tika in a more general environment. It would have been nice to discover how easy it was to use Tika from a C/C+ or C# program, say. The book is also light on exactly what file formats Tika can tackle. This is information you can find on the website but it would have been nice to have a bit more details of the practicalities of using Tika. In fact. Tika isn't quite the Babel fish that you might expect. The number of formats supported as standard isn't that huge and there are few legacy formats, so don't expect it to deal with WordStar or WordPerfect documents for example. This isn't the impression given by the glowing write up in the book.
If you are looking for a book that talks mostly about the bigger issues with a side order of actually using Tika, then this might suit. If you prefer the cookbook style that takes you inside Tika and shows how useful it might be an real project, then you probably are better off with the documentation.