|Elasticsearch The Definitive Guide|
Authors: Clinton Gormley and Zachary Tong
Elasticsearch is gaining popularity as a distributed search engine, coming in at second place for enterprise search engines after Solr, partially because developers like its RESTful interface and schema-free JSON documents, along with the way indexes can be sharded and replicated if necessary.
This book was written by two people who know it back to front; Clinton Gormley was the first user of Elasticsearch and wrote the Perl API for it, and Zachary Tong is a developer at Elasticsearch. To live up to its claim to constitute a Definitive Guide it takes you from getting started through to advanced topics in 46 chapters arranged into seven parts.
Part I introduces Elasticsearch, showing how to get data into and out of Elasticsearch, and how it interprets the data in your documents. There are chapters on the basic search tools, mapping and analysis, full-body search, and sorting and relevance. The final three chapters in this section cover distributed searches, index management, and what goes on inside a shard. For many people, this part of the book will be as much as they’ll ever need, but the authors are just getting into their strides.
Part II covers search in depth, and is really interesting (if you happen to like data, that is). The authors take a more detailed look at structured searches and full-text search, before chapters on multifield search, proximity matching, partial matching, and controlling relevance. If you want to use Elasticsearch on ‘standard’ data, this section gives you an excellent lowdown.
Part III looks at language analysis. Elasticsearch has a collection of language analyzers for the most common languages you’re likely to encounter , from Arabic to Thai, with some that are surprising – Basque and Kurdish, for example. The analyzers tokenize the text into individual words, remove common stopwords, and stem tokens to their root form. There are chapters on each of these topics, along with stop words, synonyms, and the wryly titled ‘typoes and mispelings’. In each case, the authors describe how the standard analyzers work, and what you can do to avoid problems with words or phrases that won’t be automatically dealt with. If you need to sort a German phonebook or set up your own list of synonyms, there’s the code to show how to do it.
Aggregations are handled in the next part of the book. The techniques described show how you can ‘zoom out’ to get an overview of your data. The authors say aggregations let you ask questions such as ‘how many needles are in added to the haystack each month?’, or ‘what is the average length of the needles?’ There are chapters on building bar charts, looking at time, scoping aggregations, and filtering queries and aggregations. There’s an interesting chapter on sorting multivalue buckets in different sort orders, and the chapter on approximate aggregations gives a really clear explanation on the problem of choosing the right aggregation algorithm for dealing with different types of distributed data. The section closes with chapters on significant terms and on controlling memory use and latency.
Geolocation is the subject of the next part of the book. The initial chapter describes the use of latitude and longitude co-ordinates, and how to query with bounding boxes and range filters. Geo-hashes and Geo-aggregations are dealt with next. Geo-hashes let you encode lat/long points as strings, and while they started as a way to specify geo-locations in URLs, they are now used for indexing geo-points in databases. The section ends with a look at geo-shapes, and how you need to treat them differently to individual geo-points.
The authors next move on to data modeling, pointing out that while Elasticsearch treats all data as flat rather than relational, relationships matter and you need to find ways to model joins, nested objects, and parent/child relationships. Each of these topics gets a chapter, along with ways to design for handling scale, with particular reference to time-based data such as logs where relevancy is driven by recency, and user-based data. The book closes with a short section on administration, monitoring and deployment.
This is a well written and informative book. The chapters are short and to the point, and there’s plenty of code to show you how to achieve specific objectives. If you need to know about Elasticsearch and how to use it, this is the book you need.
|Last Updated ( Monday, 09 March 2015 )|