|Apache Flume 2nd Ed|
Page 1 of 2
Author: Steve Hoffman
This book discusses Flume, a popular tool for moving log data into Hadoop, how does it fare?
This book is aimed at people responsible for getting data from various sources into Hadoop. No previous experience of Flume is assumed. The book does assume a basic knowledge of Hadoop and Hadoop Distributed File System (HDFS), and some Java if you want to make use of any custom implementations.
Hadoop is the most popular platform for processing big data. However, before processing can occur, data needs to get into Hadoop. Sqoop is the tool used to import data from relational databases into Hadoop, and Flume is used to move other data, typically log files, into Hadoop. Most Flume development revolves around configuration settings.
Below is a chapter-by-chapter exploration of the topics covered.
Chapter 1 Overview and Architecture
This chapter opens with a look at the origins of Flume, created by Cloudera before being passed to Apache in 2011. Flume was then refactored, resulting in simpler on-disk configuration files that are easily managed by various tools such as Chef and Puppet.
The chapter continues with a brief look at sources (inputs), sinks (outputs), and channels (linking sources to sinks). Everything runs inside a daemon called a Flume agent. A helpful diagram showing this is provided.
Next, Flume events are discussed, events consist of zero or more headers and a body that holds the actual data. An interceptor can inspect and alter Flume events, and these can be chained together. Channel selectors are responsible for how data is moved from source to channel (e.g. write to many channels). A sink processor provides a means of creating failover paths for sinks or load balance events.
The chapter ends with a look at the Kite SDK. In particular, the Morphline component is highlighted, this allows the chaining together of commands to form a data transformation pipe (similar to UNIX pipelining commands).
This chapter provides a brief but useful overview of the major components of Flume, providing a good base for subsequent chapter discussions. There’s a useful point about not managing the configuration files manually. NameNode and MapReduce are mentioned without any explanation – highlighting your need to know about Hadoop.
Useful discussions, diagrams, configuration settings, website links, inter-chapter links, and miscellaneous tips are given throughout. These traits apply to the whole of the book.
Chapter 2 A Quick Start Guide to Flume
This chapter opens with a look at where to download Flume, discussing the tar files, checksums, and signature files – the latter files being used for authentication.
The chapter continues with a look at the Flume configuration file – this contains key/value pair property settings. A single file can be used for multiple agents or each agent can have its own configuration file. The three simple entries for sources, channels, and sinks are discussed.
The chapter ends with a basic “Hello, World!” example. The Flume tar file is unpacked, and Flume help is examined via the flume-ng help command. A Hello World configuration file is created and its content discussed. The agent is then started, and tested, and the output verified against the agent log file content.
This chapter provides useful detail on where to download Flume, and how to install it. A basic “Hello, World!” Flume example is created and discussed, to get you up and running quickly.
Chapter 3 Channels
This chapter looks at channels (memory or file), which link the source and sink. Your use cases will determine the channel type.
The chapter opens with a look at the memory channel. This is non-durable and fast, but has a relatively small capacity, and can result in data loss – this may not be a problem if the data is not critical (e.g. performance metrics). There’s a useful table of memory channel configuration parameters, and these are discussed.
Next, the chapter looks at the file channel. While this is slower, it’s durable, and has a larger capacity when compared with the memory channel. This is the best choice when you need all the event data. There’s a useful table of file channel configuration parameters, and these are discussed.
The chapter ends with a look at an experimental channel type, the Spillable Memory channel. This behaves like the memory channel until the memory fills, it then behaves like the file channel. The author doesn’t like this channel type, explaining convincingly that its nondeterministic nature undermines capacity planning. Again, there’s a useful table of channel configuration parameters, and these are discussed.
This chapter provides a useful overview of the different channel types, what they are, and their relative strengths and weaknesses. Although the author doesn’t like the Spillable Memory channel, he acknowledges that others may have a different opinion. Useful tables explain the various configuration parameters. The author mentions RAID but doesn’t explain what it is.
Chapter 4 Sinks and Sink Processors
The chapter opens with a look at the HDFS sink. This is the most popular Hadoop sink. The configuration entry tells the sink to read the events for a given channel. There’s a useful table of HDFS sink configuration parameters, and these are discussed. Next, files and paths, file rotation, and compression are discussed. Compression typically results in improved performance.
The chapter proceeds with a look at event serializers, this is a mechanism for converting Flume events into another format. Text and Avro serializers are briefly discussed. Other topics examined include timeouts and sink groups.
Next, the chapter looks at the MorphlineSolrSink, this sink allows you to write to Solr – which provides real-time search functionality. Instead of applying interceptors prior to writing to the sink, each Flume event is converted to a Morphline record. The section continues with a look at Morphline configuration files, with a helpful table explaining the various Morphline sink configuration parameters.
The chapter ends with a look at the ElasticSearchSink, another common sink target. Here each event becomes a document, which is similar to a table row. Again there’s a useful table explaining the various Elasticsearch sink configuration parameters.
This chapter provides a useful overview of the popular HDFS, Morphline, and Elasticsearch sinks types. There’s a useful point about checking the Flume user guide, since things change with versions. There’s mention of HDFS block replication, highlighting the need for existing Hadoop knowledge.
Chapter 5 Sources and Channel Selectors
This chapter is concerned with getting data from various sources into Flume. The chapter opens with a look at using the tail source, which was in the initial version of Flume, but has since been removed. Various reasons for its removal are discussed.
Next, the Exec source is discussed. This allows running of a command outside Flume, with the output becoming Flume events. An example is provided, together with a table of explained parameters.
The chapter continues with a look at syslog sources, these are an OS-level means of capturing and moving logs around the system. Much of the functionality overlaps with Flume. The section takes a look at the older syslog UDP source, the newer syslog TCP source, and the multiport syslog TCP source. In all cases, the relevant source configuration file parameters are discussed. The JVM source is briefly discussed, here the Flume JVM source is used to create events from a JMS Queue or Topic.
The chapter ends with a look at channel selectors. A source can write to multiple channels, either by writing to all channels or to a specific channel based on details in a Flume header value – using channel selectors. The relevant configuration property values are discussed.
This chapter provides a useful discussion of Flume sources that can be used to insert data into Flume. As always, there are helpful tables and discussions of the configuration parameters. There’s some useful incidental code that deletes completed files over 7 days old.
|Last Updated ( Monday, 26 October 2015 )|