Getting Started with Impala

Author: John Russell
Publisher: O'Reilly
Pages: 110

ISBN: 9781491905779
Print: 1491905778
Kindle: B00NX1L9OE
Audience: Analysts, developers and users
Rating: 4.5

Reviewer: Ian Stirk

This book aims to get you up-and-running with Impala – a tool for quickly querying Hadoop’s big data. How does it fare?

It is targeted at analysts, developers and users who are new to querying Hadoop’s big data. Experience of SQL and databases would be useful, but is not mandatory. This is a short book, containing 110 pages split into five chapters.

Chapter 1 Why Impala?

The chapter opens with a look at Impala’s place within Hadoop’s ecosystem of components. Big data stores massive amounts of data, and querying this data is typically a batch process. Impala can often query this data within seconds or minutes at most, giving a near ‘interactive’ response. Additionally, compared with traditional big data tools like Java, Impala provides a much faster development cycle.

The chapter continues with a look at how Impala readily integrates with Hadoop’s components, security, metadata, storage (HDFS), and file formats. Impala can perform complex analysis using SQL queries that can calculate, filter, sort and format.

Next, it’s suggested you may need to change the way you work with big data. Previously, queries worked in batch mode, which often required a context-switch as you moved back and forth between other tasks. This view changes with Impala, which often provides a near interactive experience.

The chapter ends with a look at Impala’s flexibility, able to work with raw data files, in many different formats. This means there are fewer steps than in traditional processing (i.e. no need for filtering or conversion of data), so arriving at solutions is faster.

This chapter provided a useful introduction to Impala, describing what it is, what it’s used for, and giving its advantages: quick and easy development, fast queries, and integration with existing Hadoop components.

Chapter 2 Getting Up and Running with Impala

This chapter opens with the various ways of installing Impala, giving the advantage of each. The methods are:

Cloudera Live Demo – easiest, no installation. Enter Impala queries via Hue editor, create tables, load data, or query some existing tables
Cloudera QuickStart VM – single user, single-machine, can install full Hadoop stack via Cloudera’s Manager. Interact via impala-shell or ODBC and JDBC interfaces
Cloudera Manager and CDH 5 – install over real distributed system
Manual installation –needs to be applied to each node in cluster, expensive
Building from source – understand Impala at deep level

The chapter continues with a look at connecting to Impala. The book concentrates on connecting via the impala-shell. Examples are provided of connecting to the localhost (the default) and to a remote box, additionally the use of a non-default port is discussed.

The chapter ends with some sample SQL queries to run. The initial queries do not have a FROM clause, so don’t access any tables, these queries are especially useful for testing the installation and configuration of Impala is correct. SQL is also provided to create tables and insert data into a table.

This chapter provides practical detail on installing and connecting to Impala, via various setup methods. The chapter contains helpful first SQL queries to get you started. There are helpful links to Impala’s discussion forums, mailing list, and community resources.

I was surprised the use of Hue to run Impala queries wasn’t examined, since this tool provides a centralized user-friendly web-interface to many of Hadoop’s tools – a boon to all users.

Chapter 3 Impala for the Database Developer

The chapter opens with a look at Impala’s SQL language, which contains familiar topics like joins, views, and aggregations. There’s a useful link to Impala’s SQL Language Reference. Various data types are briefly discussed. The EXPLAIN statement can be used to show how the SQL is executed. Various limitations in the language are highlighted (e.g. no indexes or constraints), though if you come from a data warehouse background you’ll appreciate these are often not limitations. There’s a link to Impala’s new features documentation – it’s recommended to check this regularly for updates.

The chapter proceeds with a look at big data considerations. Here, big data is taken to mean at least gigabytes of data, containing billions rows. It’s suggested that queries are first tested on a subset of data using the LIMIT clause, if the query output looks correct the query can then be run against the whole dataset. It’s noted that if you come from a traditional transaction databases background, you may need to unlearn a few things, including: indexes less important, no constraints, no foreign keys, and denormalization is good.

The physical and logical aspects of the data are discussed next. HDFS has a large blocksize (128MB by default), and large files are broken into smaller files, and stored multiple times (3 by default) across the servers in the cluster. This together with parallel processing ensures the data is queried quickly.

Next, the execution steps of a distributed query are discussed, with Hadoop doing all the complex work. A brief review of normalized and denormalized data is given. Denormalized data is more typical of a data warehouse, often giving better performance with fewer tables with more columns.

Various file formats are examined, and the advantages of each discussed. Parquet is often the preferred file format, since it offers column-oriented layout, more suitable for certain queries and is amenable to compression. It is possible to switch between the various file formats.

The chapter ends with a brief look at aggregations, via the GROUP BY clause. It’s also possible to aggregate smaller files into larger ones (this may improve performance).

If you come from a database background, this chapter is more easily understandable, and you can probably use Impala immediately. This chapter puts your existing database knowledge (e.g. views) into a familiar context, while noting you might need to unlearn certain things (indexes, constraints). The chapter discussed some interesting HDFS topics, including blocksize, replication, and reliability – which together with the size of the data tables, impact performance. Impala’s roadmap link is useful for discovering forthcoming features (e.g. INTERSECT, ROLLUP, MERGE are expected in 2015).

GSImpala

Chapter 4 Common Developer Tasks for Impala

Having given a background on what Impala is, and how it can be used, this chapter moves on to a set of common tasks that you’re sure to hit as you proceed with Impala.

The chapter opens with a look at getting data into an Impala table. This starts with a look at using INSERT... SELECT, and the use of INSERT OVERWRITE to replace data. The section continues with a look at LOAD DATA to move data from HDFS into Impala. Impala can query Hive tables, but you need to issue INVALIDATE METADATA tableName to make Impala aware of the table. Sqoop can be used to move data from a database into Impala, a brief outline of Sqoop’s functionality is given.

The next task looks at porting existing SQL code to Impala. It’s noted that most code should port unchanged, but the Data Definition Language (DDL) in particular will need to change. Other changes are likely needed for deletions, updates, and transactions (all are currently missing from Impala).

Using Impala from a JDBC or ODBC application is examined next. The main point seems to be that you must remember to close your connections, else you can expect a call from your administrator!

It’s possible to use Impala from various scripting languages, including Python, Perl, and Bash. A small example script is provided. A later section discusses writing User-Defined Functions (UDFs) in C++ (for speed) or Java etc, these functions can then be used in your Impala SQL queries.

Next, various Impala optimizations are examined, including:

Optimizing query performance – run COMPUTE STATS when data changes significantly. Consider partitioning data. Choose an efficient file format, Parquet is usually recommended.
Optimizing memory usage – be aware ORDER BY, GROUP BY, UNION, DISTINCT all require more memory. Consider the impact of data type on memory. LIMIT-ing data can be good.
Use partitioned tables – act like indexes, providing fast bulk read. Often use year or country to partition. Often reduce the partition level by one level compared with other databases.

The chapter ends with a discussion about collaborating with your administrator. During development you’ll typically have freedom to do things the way you want. However, your organization will typically have a preferred way to do things, into which you’ll need to integrate. You will save yourself some stress if you consider the below during development:

Design for security – prod and dev have different permissions
Understand resource management – memory usage of certain queries or user types
Performance planning – run COMPUTE STATS when data volumes change, use HDFS caching for hot data, use partitions
Cluster technology – develop on a single node, test on limited set of nodes, prod will be fully distributed

This was an instructive chapter, answering many of the questions you’re sure to ask as you progress with your Impala work. The section on performance tips was particularly useful (e.g. no indexes, constraints, use partitions, use an optimal file format). Additionally, integrating to other systems via JDBC and ODBC should prove helpful for your development.

Chapter 5 Tutorials and Deep Dives

This chapter looks similar to the previous chapter, discussing typical concerns a new user of Impala might face, however this section delves much deeper into Impala’s functionality. Topics covered include:

Tutorial: From Unix Data File to Impala Table (create text file in Linux, load file into Hadoop/HDFS, create Impala database, create and load Impala table with data in HDFS)
Tutorial: Queries Without a Table (some useful queries, for testing out your SQL locally before apply to whole dataset, not parallelized or distributed)
Tutorial: The Journey of a Billion Rows (looks at impact of processing 1 billion rows. Generate 1 billion rows of CSV data. Create Impala table. Load data. Impact of Parquet format shown. Impact of partitioning shown)
Deep Dive: Joins and the Role of Statistics (use COMPUTE STATS after big load. Joins million row table to billion row table, for a one million billion join. See EXPLAIN plan, shows missing stats first thing. Looks at impact of Parquet file format, normalized data, and partitioning)
Tutorial: Across the Fourth Dimension (time format. Timestamp data type for date, time, both. TRUNC function to extract year, quarter, week etc. INTERVAL function to add or subtract time periods. Example code. Y2K problem)

This chapter provides plenty of deeper content, which at first might seem a little strange in an introductory book. Again there are many topics covered that you’ll surely want to investigate further (it seems performance is a common concern on all systems, even big data systems).

Conclusion

This short book aims to get you up-and-running with Impala, and succeeds commendably. Throughout, there are helpful explanations, screenshots, practical code examples, inter-chapter references, and links to websites for further information. It’s packed with useful instructions, but some sections could benefit from more code examples.

This book is suitable for analysts, developers and users that are starting out with Impala. Although aimed at the beginner, several later sections contain more advanced topics (e.g. performance). If you have a background in SQL, you will have a head start, and if you know about data warehousing, the book is even easier to understand.

The world of Hadoop and its components changes frequently, so be sure to check out Impala’s latest functionality on the Cloudera site.

Impala is a popular tool for querying Hadoop’s data quickly, much quicker than other tools. Additionally, the development cycle for Impala queries is much shorter than for comparable tools like Java and MapReduce processing. I would suggest Impala should be your first choice for querying data, even if the underlying data is stored in some other component (e.g. Hive).

Obviously there is much more to learn about Impala than what’s given in this small book, but this book is a great place to start learning. Highly recommended.

Machine Learning Q and AI (No Starch Press)

Author: Sebastian Raschka
Publisher: No Starch Press
Date: April 2024
Pages: 264
ISBN: 978-1718503762
Print: 1718503768
Kindle: B0CKKXCK3T
Audience: Developers interested in AI
Rating: 4
Reviewer: Mike James
Q and AI, a play on Q&A is a clever title, but is the book equally clever?

+ Full Review

Software Mistakes and Tradeoffs (Manning)

Author: Tomasz Lelek and Jon Skeet
Publisher: Manning
Date: June 2022
Pages: 426
ISBN: 978-1617299209
Print: 1617299200
Audience: C# developers
Rating: 4
Reviewer: Mike James
We all make mistakes - do you want to read about them?

+ Full Review

More Reviews

Last Updated ( Monday, 11 May 2015 )

Recent Articles

Recent Book Reviews

Popular Articles