Apache Flume 2nd Ed
Article Index
Apache Flume 2nd Ed
Chapters 6 - 9, Conclusion

Author: Steve Hoffman
Publisher: Packt Publishing
ISBN: 978-1784392178
Print: 1784392170
Kindle: B00U1D9WSM
Audience: Flume devs – all levels
Rating: 4.7
Reviewer: Ian Stirk

 

Chapter 6 Interceptors, ETL, and Routing

This chapter is concerned with inspecting and transforming in-flight events, using interceptors. It opens with a look at various interceptors, including:  

  • Timestamp (adds timestamp header)

  • Host (adds host or IP address header)

  • Static (insert single key/value header into each event)

  • Regular expression extractor (allows building of header from parts of body, thus allow filtering via a channel selector)  

The chapter continues with a look at tiering flows, here you can limit the number of Flume agents running or limit the number of requests connecting to Hadoop (maybe due to space limits). This is typically achieved by chaining Flume agents to use Avro source/sink pairs. The Avro source/sink and the Thrift source/sink are discussed in more detail. The former offers compression and SSL, and the latter is compatible with many programming languages. Two Log4J appenders are also discussed.

Next, the Embedded Flume Agent is discussed, this allows integration of Flume functionality within a Java application – this has both advantages and disadvantages, which are highlighted. The chapter ends with a look at using interceptors to provide routing, examples are provided.

This chapter provides a useful discussion of the various types of interceptors, and how they can be used to provide ETL and routing functionality.

 

Chapter 7 Putting It All Together

This chapter takes everything learned in the previous chapters to implement two common use cases, providing end-to-end implementation detail.

The first use case involves getting and streaming web logs to a searchable application. The example has three servers in Amazon’s Elastic Compute Cluster (EC2). The first server is the web server containing the logs, the second server is the collector, and the last server contains Elasticsearch.

The chapter provides a step-by-step walkthrough on how to set up each of the servers, including downloading and configuring any required software (including Flume, Elasticsearch, Ngina web server, and Kibana user interface). A simple usage test is provided.

The second common use case examines archiving data to HDFS. This use case extends the previous use case, so that it writes the log data out to Hadoop (in addition to Elasticsearch). Again, there is a step-by-step walkthrough of everything you require i.e. download a version of Hadoop, format a new HDFS volume, start HDFS daemons, create a second channel and HDFS sink on the collector box. Again a simple usage test is provided.

This chapter provides useful hands-on walkthroughs to implement two common use cases. In many ways this chapter is the culmination of the book, and the end of the book itself.

 

Chapter 8 Monitoring Flume

Monitoring is important for ensuring your system is working as it should. The author notes that Flume monitoring is still a work-in-progress.

The chapter opens with a look at various means of monitoring the agent process. The Monit tool is briefly discussed, it provides free basic functionality (e.g. says if agent is running, restarts if stopped, sends email of failure etc), it also monitors CPU, disk and memory. The popular monitor tool Nagios is discussed next, this can watch Flume agents and provide web alerts, however it doesn’t provide restart functionality. The author rightly acknowledges that companies often have monitor software already in place, often it’s advisable to start to use these first before recommending other software.

The chapter next looks at monitoring performance metrics. This can be helpful in ensuring data is entering sources at the expected rates, and not overflowing the channels. Flume provides a pluggable monitoring framework, but this is still being worked on. There’s a brief review of sending metric data to Ganglia, and an internal HTTP server.

This chapter provides a useful review of some helpful monitoring tools. Perhaps the biggest concern is monitoring within Flume itself is still a work in progress. You can discover more about monitoring in Hadoop in the my recent review of Monitoring Hadoop.

 

Chapter 9 There Is No Spoon – the Realities of Real-time Distributed Data Collection

This chapter provides a miscellany of thoughts vaguely centred on data collection into Hadoop. Topics discussed include: 

  • Transport time versus log time (can cause processing problems)

  • Time zones are evil (suggests using UTC everywhere)

  • Capacity planning (things change over time - keep 20% free space)

  • Compliance and data expiry (data can be sensitive – has links to regulatory websites) 

While interesting, many of the topics are oblique, I suspect this chapter is not needed or perhaps the relevant points could have been made in other chapters? If you’re confused about the chapter title, you need to watch The Matrix.￿

Conclusion

This book has well-written discussions, useful hands-on walkthroughs, diagrams, configuration settings, website links, inter-chapter links, chapter summaries, and miscellaneous tips throughout.

I enjoyed the author’s approach - he is enthusiastic and explains choices in a considered manner, acknowledging that other opinions exist. He encourages you to test your own use cases on your system. You do need to have an awareness of Hadoop to make full use of this book.

This book will enable you to create Flume agents to transfer log data into Hadoop, with due consideration. I highly recommend this book.

Banner


Beginning Programming All-in-One For Dummies

Author: Wallace Wang
Publisher: For Dummies
Pages: 800
ISBN: 978-1119884408
Print: 1119884403
Kindle: B0B1BLY87B
Audience: Novice programmers
Rating: 3
Reviewer: Kay Ewbank

This is a collection of seven shorter books introducing key aspects of programming, but it fails through trying to cover too [ ... ]



SQL Server 2022 Administration Inside Out

Author: Randolph West et al
Publisher: Microsoft Press
Pages: 992
Print: 0137899882
ISBN: 978-0137899883
Kindle: B0C4VKVP27
Audience: DBAs and developers
Rating: 5.0
Reviewer: Ian Stirk

This book aims to update your DBA skills to cover SQL Server 2022, how does it fare?


More Reviews

 



Last Updated ( Monday, 26 October 2015 )