Taming Text

Author: Grant S. Ingersoll, Thomas S. Morton and Andrew L. Farris
Publisher: Manning
Pages: 320
ISBN: 978-1933988382
Audience: Java programmers interested in processing text
Rating: 4.5
Reviewer: Alex Armstrong

What do you think a book called "Taming Text" is all about? 

It could be about Unicode or advanced regular expressions or ...

It is important to note that these essentially core text technologies are not what this book is about. What it is about is the task of working with text in an semi-intelligent way.

It is about searching and organizing text in a way that makes sense to a human. Now this is a big task and not just confined to explaining how text is represented in a given programming language. It heads in the direction of Artificial Intelligence (AI) but without needing the complete understanding that such text processing might seem to need. It more or less fits into the category of Natural Language Processing (NLP). In general the methods used in current NLP are statistical and based on any understanding of what the text means.  




Chapter 1 starts of by setting the scene - why you might need this sort of text processing. If you already are an NLP enthusiast you probably don't need to read it but it gets you started nice an easily. 

Chapter 2 is where things really get into gear. It explains the workings of language, well the English language, by working its way through the useful levels of looking at text and providing labels for the different parts of speech. Rather than just being a theory lesson, it also points you in the direction of resources that you can use to identify parts of speech for example. It also discusses the problem of actually reading in the text from files in different formats using the first of the many open source programs discussed in the book - i.e. Apache Tika. 




Chapter 3 deals with the problems of intelligent search using Apache Solr. It is a basic introduction to Solr, how to get it setup and how to customize and optimize it. Chapter 4 moves on to the problems of fuzzy string matching and it first discusses some of the measures of similarity that you can work out. The ideas are implemented with reference to Solr in particular. 

Chapter 5 is called "Identifying people, places and things" and it discusses the named entity recognition problem. This is our first introduction to OpenNLP. Next we find out about clustering text using a range of methods and tools including Carrot and Mahout to implement k-means. Chapter 7 extends this to classification using Lucene. 

In Chapter 8 we discover what the object of the entire exercise has been in that it details the implementation of an example question answering system. To find out much about it you are going to have to run the code provided at the book's website.

The final chapter considers the future of the technology including a quick look at working with other languages, sentiment analysis and the long term goal of semantic analysis. 

This is not a text book nor is it a research monograph. It is aimed at programmers who need to understand enough about NLP to build an intelligent question answering system or similar. You will learn the theory as you go along but it is all explained in fairly plain language and via programming examples. You will need to program in Java and all of the tools are in the main Java oriented. If you are not a Java programmer you can understand the ideas presented but you will probably struggle to get the examples working. The book is also based on opens source tools that are part of the Java eco system - for example Solr, Lucene, Tika, Mahout and so on. If you plan to use other tools or other language then the book will be of less use.

Don't expect the book to show you how to implement complete text understanding, or to show you how to build a system like IBM's Watson question-answering machine. It gives you a very good and very practical overview of what you can achieve fairly easily and with moderate resources.

It is a good Java-oriented introduction to NLP and as such recommended. 



Using Asyncio in Python

Author: Caleb Hattingh
Publisher: O'Reilly
Date: February 2020
Pages: 166
ISBN: 978-1492075332
Print: 1492075337
Kindle: B084D653HW
Audience: Python developers
Rating: 2
Reviewer: Ian Elliot
Asycio is the new way to do asynchronous code in Python and  you probably do want to know about it.

The Big Book of Small Python Projects

Author: Al Sweigart
Publisher: No Starch Press
Date: June 2021
Pages: 432
ISBN: 978-1718501249
Print: 1718501242
Kindle: B08FH9FV7M
Audience: Novice Python developers
Rating: 4
Reviewer: Lucy Black
A project book? A good way to learn Python?

More Reviews

Last Updated ( Wednesday, 04 December 2013 )