Author: Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris
Audience: Java programmers interested in processing text
Reviewer: Alex Armstrong
What do you think a book called "Taming Text" is all about?
It could be about Unicode or advanced regular expressions or ...
It is important to note that these essentially core text technologies are not what this book is about. What it is about is the task of working with text in an semi-intelligent way.
It is about searching and organizing text in a way that makes sense to a human. Now this is a big task and not just confined to explaining how text is represented in a given programming language. It heads in the direction of Artificial Intelligence (AI) but without needing the complete understanding that such text processing might seem to need. It more or less fits into the category of Natural Language Processing (NLP). In general the methods used in current NLP are statistical and based on any understanding of what the text means.
Chapter 1 starts of by setting the scene - why you might need this sort of text processing. If you already are an NLP enthusiast you probably don't need to read it but it gets you started nice an easily.
Chapter 2 is where things really get into gear. It explains the workings of language, well the English language, by working its way through the useful levels of looking at text and providing labels for the different parts of speech. Rather than just being a theory lesson, it also points you in the direction of resources that you can use to identify parts of speech for example. It also discusses the problem of actually reading in the text from files in different formats using the first of the many open source programs discussed in the book - i.e. Apache Tika.
Chapter 3 deals with the problems of intelligent search using Apache Solr. It is a basic introduction to Solr, how to get it setup and how to customize and optimize it. Chapter 4 moves on to the problems of fuzzy string matching and it first discusses some of the measures of similarity that you can work out. The ideas are implemented with reference to Solr in particular.
Chapter 5 is called "Identifying people, places and things" and it discusses the named entity recognition problem. This is our first introduction to OpenNLP. Next we find out about clustering text using a range of methods and tools including Carrot and Mahout to implement k-means. Chapter 7 extends this to classification using Lucene.
In Chapter 8 we discover what the object of the entire exercise has been in that it details the implementation of an example question answering system. To find out much about it you are going to have to run the code provided at the book's website.
The final chapter considers the future of the technology including a quick look at working with other languages, sentiment analysis and the long term goal of semantic analysis.
This is not a text book nor is it a research monograph. It is aimed at programmers who need to understand enough about NLP to build an intelligent question answering system or similar. You will learn the theory as you go along but it is all explained in fairly plain language and via programming examples. You will need to program in Java and all of the tools are in the main Java oriented. If you are not a Java programmer you can understand the ideas presented but you will probably struggle to get the examples working. The book is also based on opens source tools that are part of the Java eco system - for example Solr, Lucene, Tika, Mahout and so on. If you plan to use other tools or other language then the book will be of less use.
Don't expect the book to show you how to implement complete text understanding, or to show you how to build a system like IBM's Watson question-answering machine. It gives you a very good and very practical overview of what you can achieve fairly easily and with moderate resources.
It is a good Java-oriented introduction to NLP and as such recommended.