Hadoopi - Raspberry Pi Hadoop Cluster
Written by Kay Ewbank   
Tuesday, 17 July 2018

There's an updated version of Hadoopi, a Hadoop distribution for the Raspberry Pi. Hadoopi supports various components of the Hadoop ecosystem including HBase, Hive, and Spark. The new release has wired networking (for improved performance and reliability) plus the addition of metrics collection with Prometheus and visualisation of those metrics in Grafana dashboards.

Hadoopi is a project on GitHub that has the configuration files and Chef code to configure a cluster of five Raspberry Pi 3s as a working Hadoop distribution running Hue.



If the idea of running Hadoop on Raspberry Pis sounds unlikely, it follows a number of earlier experiments doing similar things with Hadoop and Raspberry PIs. It is also an illustration of just what the Raspberry Pi is capable of, along with other eyecatching uses we've covered such as the Raspberry Pi cluster created by GCHQ, the UK equivalent of the NSA, and the simulation of the Turing-Welchman Bombe.

The notion of using Raspberry Pis to learn Hadoop makes a lot of sense. If you're interested in big data and want to get to grips with Hadoop, you either need to spend a lot of money on hardware or you use a cloud-based Hadoop distribution. Hadoop has a distributed architecture that needs multiple computers to work. Cloud-based systems mask the interaction between the software and the hardware, so making it harder to get an in-depth understanding of how the cluster works.

One of the first Raspberry Pi-based Hadoop systems was demonstrated by Jamie Whitehorn at the Strata + Hadoop World conference in 2013, and Andy Burgin, the creator of Hadoopi acknowledges Whitehorn as one of the inspirations of Hadoopi.

On GitHub Burgin, points out that while Hadoopi runs most of the Hadoop ecosystem, there's no support for Impala because the Pi hasn't really got enough power while running other Hadoop components. Security is HDFS based rather than using Sentry, and the hardware limitations in terms of 1GB of memory and only four cores means  you can really only run one task at a time and it's not fast. Burgin says:

"Its slowwwwwwwww - the combination of teeny amount of RAM and only 4 cores means this is not built for speed, be realistic with your expectations!"

The fact that the Raspberry Pi is ARM based means you have to compile Hadoop (with the correct version of protobuf libraries, Oozie and Hue.

The updated version of Hadoopi adds support for Grafana Dashboards that are are built using Prometheus metrics collected by a number of exporters. The Node exporter provides metrics about the state of each node; the MySQL exporter provides metrics for the mysql server; and the JMX exporter exposes selected jmx metrics.

The Hadoopi page on GitHub has full details of how to set up and run the cluster.



More Information

Hadoopi On GitHub

Related Articles

A New Raspberry Pi For Pi Day

GCHQ Builds A Raspberry Pi Cluster

Raspberry Pi 2 - Quad Core And Runs Windows

Astro Pi - What Can A Raspberry Pi Do In Space?

TJBot - Using Raspberry Pi With Watson

Simulating the Turing-Welchman Bombe With A Pi

Hadoop 3 Adds HDFS Erasure Coding

Hadoop 2.9 Adds Resource Estimator

Hadoop Adds In-Memory Caching

Hadoop SQL Query Engine Launched

Hadoop 2 Introduces YARN


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


GitHub Announces AI-Powered Changes

GitHub has announced changes to its platform that will use AI "in every step of the developer lifecycle". The intention is to make natural language become the universal programming language. The annou [ ... ]

PeerDB Brings Real Time Streaming To PostgreSQL

PeerDB is an ETL/ELT tool built for PostgreSQL. It makes all tasks that require streaming data from PostgreSQL to third party counterparts as effortless as it gets.

More News




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 17 July 2018 )