|Apache Flume 2nd Ed|
Page 2 of 2
Author: Steve Hoffman
Chapter 6 Interceptors, ETL, and Routing
This chapter is concerned with inspecting and transforming in-flight events, using interceptors. It opens with a look at various interceptors, including:
The chapter continues with a look at tiering flows, here you can limit the number of Flume agents running or limit the number of requests connecting to Hadoop (maybe due to space limits). This is typically achieved by chaining Flume agents to use Avro source/sink pairs. The Avro source/sink and the Thrift source/sink are discussed in more detail. The former offers compression and SSL, and the latter is compatible with many programming languages. Two Log4J appenders are also discussed.
Next, the Embedded Flume Agent is discussed, this allows integration of Flume functionality within a Java application – this has both advantages and disadvantages, which are highlighted. The chapter ends with a look at using interceptors to provide routing, examples are provided.
This chapter provides a useful discussion of the various types of interceptors, and how they can be used to provide ETL and routing functionality.
Chapter 7 Putting It All Together
This chapter takes everything learned in the previous chapters to implement two common use cases, providing end-to-end implementation detail.
The first use case involves getting and streaming web logs to a searchable application. The example has three servers in Amazon’s Elastic Compute Cluster (EC2). The first server is the web server containing the logs, the second server is the collector, and the last server contains Elasticsearch.
The chapter provides a step-by-step walkthrough on how to set up each of the servers, including downloading and configuring any required software (including Flume, Elasticsearch, Ngina web server, and Kibana user interface). A simple usage test is provided.
The second common use case examines archiving data to HDFS. This use case extends the previous use case, so that it writes the log data out to Hadoop (in addition to Elasticsearch). Again, there is a step-by-step walkthrough of everything you require i.e. download a version of Hadoop, format a new HDFS volume, start HDFS daemons, create a second channel and HDFS sink on the collector box. Again a simple usage test is provided.
This chapter provides useful hands-on walkthroughs to implement two common use cases. In many ways this chapter is the culmination of the book, and the end of the book itself.
Chapter 8 Monitoring Flume
Monitoring is important for ensuring your system is working as it should. The author notes that Flume monitoring is still a work-in-progress.
The chapter opens with a look at various means of monitoring the agent process. The Monit tool is briefly discussed, it provides free basic functionality (e.g. says if agent is running, restarts if stopped, sends email of failure etc), it also monitors CPU, disk and memory. The popular monitor tool Nagios is discussed next, this can watch Flume agents and provide web alerts, however it doesn’t provide restart functionality. The author rightly acknowledges that companies often have monitor software already in place, often it’s advisable to start to use these first before recommending other software.
The chapter next looks at monitoring performance metrics. This can be helpful in ensuring data is entering sources at the expected rates, and not overflowing the channels. Flume provides a pluggable monitoring framework, but this is still being worked on. There’s a brief review of sending metric data to Ganglia, and an internal HTTP server.
This chapter provides a useful review of some helpful monitoring tools. Perhaps the biggest concern is monitoring within Flume itself is still a work in progress. You can discover more about monitoring in Hadoop in the my recent review of Monitoring Hadoop.
Chapter 9 There Is No Spoon – the Realities of Real-time Distributed Data Collection
This chapter provides a miscellany of thoughts vaguely centred on data collection into Hadoop. Topics discussed include:
While interesting, many of the topics are oblique, I suspect this chapter is not needed or perhaps the relevant points could have been made in other chapters? If you’re confused about the chapter title, you need to watch The Matrix.
This book has well-written discussions, useful hands-on walkthroughs, diagrams, configuration settings, website links, inter-chapter links, chapter summaries, and miscellaneous tips throughout.
I enjoyed the author’s approach - he is enthusiastic and explains choices in a considered manner, acknowledging that other opinions exist. He encourages you to test your own use cases on your system. You do need to have an awareness of Hadoop to make full use of this book.
This book will enable you to create Flume agents to transfer log data into Hadoop, with due consideration. I highly recommend this book.
|Last Updated ( Monday, 26 October 2015 )|