|Google BigQuery: The Definitive Guide
Author: Valliappa Lakshmanan and Jordan Tigani
Google BigQuery is a distributed, serverless SQL engine that provides a way to query petabytes of data. It has built in machine learning, and is serverless. This book by Google insiders aims to show what it can do and how you can do it.
The interesting thing about BigQuery is that the basics are very familiar for database developers - you can get a long way with basic SQL. To get the best from BigQuery, though, you need to think about less conventional topics such as concurrency, and perhaps to move into its machine learning capabilities.
The authors of the book both work for Google, one as head of data analytics and AI solutions, the other as the director of product management for BigQuery. More importantly, he was one of the founding engineers working on BigQuery. This means both authors are talking from a position of real familiarity with BigQuery and what it can do.
The book starts with a chapter explaining what BigQuery is and what it can do, including some background on how Google came to develop it and what makes it possible. The authors then move on to describing query essentials, starting with simple queries based on Select, and filtering using Where, Except and Replace. This chapter works through a range of keywords and principals that would look familiar in any book on SQL - aggregates, joins, and views, for example. The next chapter has a lot of familiarity too - datatypes, numeric functions, string, time and date and Boolean functions.
From here on, though, the story gets less familiar, as the authors show how to load data into BigQuery, and look at federated queries drawing data from multiple data sources, and the use of Cloud Dataflow to read and write data from BigQuery.
The next chapter moves on to developing with BigQuery using the REST API and the Cloud Client Library. This chapter also introduces accessing BigQuery from tools including pandas, Jupyter, and R, as well as the JDBC drivers. The chapter ends with a look at Bash scripting with BigQuery.
Next comes an interesting chapter on the architecture of BigQuery, the life of a query request, the Dremel query engine, and how BigQuery uses storage. The authors then move on to optimizing performance and cost when using BigQuery. This is a long chapter - 60 pages - full of detailed information and code for measuring and troubleshooting query performance, how to increase query speed, and how to optimize where data is stored and accessed.
A meaty chapter on machine learning is next on the agenda, coming in at 60 pages and including coverage of building a regression model, building a classification model, and means clustering, as well as recommender systems, and using custom machine learning models. The book ends with a chapter on administering and securing BigQuery.
This is a good book, bristling with practical examples and code, and detailed step by step instructions where appropriate. For example, in the chapter on loading data into BigQuery, you're shown how to load from a local source, with discussions of why it's a good idea to compress the file, how to page through the gzipped file from Cloud Shell, whether to choose loading or streaming, as well as a SQL query to actually query the dataset. In other words, you get a mix of the why as well as the how, with code to follow and modify. If you work your way through the examples in the book, you'll have a good grasp of just what Google BigQuery can do, and why you might want to use it.
|Last Updated ( Saturday, 28 November 2020 )