Hive on Hadoop for MongoDB
Hive on Hadoop for MongoDB
Written by Kay Ewbank   
Thursday, 22 August 2013

There’s a new version of 10gen’s MongoDB Connector for Hadoop with added support for Apache Hive and incremental MapReduce jobs.


The MongoDB Connector for Hadoop presents MongoDB as a Hadoop-compatible file system so that real-time data from MongoDB can be read and processed by Hadoop MapReduce jobs. It examines the MongoDB collection and calculates a set of splits from the data. Each split is assigned to a node in the Hadoop cluster, and in parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and process them locally. Hadoop then merges the results and streams the output back to MongoDB or BSON.

The major changes to the new version start with the Apache Hive with SQL-like queries across live MongoDB data sets. Hive is a query engine for Hadoop that provides an alternative to writing MapReduce jobs for analyzing Hadoop Distributed File System (HDFS) datasets. Using Hive with MongoDB won’t be completely straightforward; some MongoDB data types such as ObjectID don’t have direct matches in Hive, and it may be tricky to work out how to express field mappings between Hive fields and MongoDB fields so that all cases are handled correctly because of the different underlying data models.



The new version adds support for MongoDB’s native BSON (Binary JSON) backup files which can be stored locally in HDFS so reducing data movement between MongoDB and Hadoop. The ability to work on MongoDB backup files also opens the possibility of reducing the load on an operational cluster; analysis could be carried out on the backup without significant loss of accuracy.

The new version also adds support for incremental MapReduce jobs making it easier to carry out efficient ad-hoc analytics. This is achieved using a new feature, MongoUpdateWriteable, that allows Hadoop to modify an existing collection in MongoDB, rather than only writing to new collections. Using this, you can run incremental MapReduce jobs to aggregate trends or pattern matching on a daily basis, which can then be efficiently queried in a single collection by MongoDB.

There’s a good webinar explaining the new features and its also summarized in this slideset: 




More Information


10gen Webinar

Related Articles

MongoDB 2.4 Released

Cash Injections for MongoDB 

MongoDB in Action (book review)

Huffington Post Chooses MongoDB, Scala and Angular JS

Programming Hive (O'Reilly)

BlinkDB Alpha of Approximate Query Engine Released 


To be informed about new articles on I Programmer, install the I Programmer Toolbar, subscribe to the RSS feed, follow us on, Twitter, Facebook, Google+ or Linkedin,  or sign up for our weekly newsletter.






or email your comment to:



What's Important To Get A Developer Job

New data from HackerRank reveals that three out of four technical recruiters and hiring managers have hired individuals whose resumes would not have passed the screening process. It also warns th [ ... ]

Apache Flink 1.5.0 Adds Support For Broadcast State

The latest version of Apache Flink has been released with a rewritten deployment and process model, and support for broadcast state.

More News

Last Updated ( Thursday, 22 August 2013 )

RSS feed of news items only
I Programmer News
Copyright © 2018 All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.