Hadoop 3 Adds HDFS Erasure Coding
Hadoop 3 Adds HDFS Erasure Coding
Written by Kay Ewbank   
Wednesday, 20 December 2017

 There's a new release of Hadoop with improvements including support for HDFS Erasure coding, a preview of v2 of the YARN Timeline Service, and improvements to YARN/HDFS federation.

Hadoop is a framework that can be used to process large data sets across clusters of computers using simple programming models. YARN is a framework for job scheduling and cluster resource management, and high availability for the HDFS filing system.

YARN federation is used to scale single YARN clusters to tens of thousands of nodes, by federating multiple YARN sub-clusters.

The new release was described by Andrew Wang, Apache Hadoop 3 release manager, as a major milestone for the project, and Hadoop's biggest release ever.
 

The addition of HDFS erasure coding should make data more durable and to reduce the amount of storage needed for HDFS. The default three times replication scheme in HDFS has a 200 per cent  overhead in storage space and other resources such as network bandwidth. For many datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica. If Erasure Coding is used in place of replication, the storage overhead is no more than 50 per cent. HDFS Erasure Coding uses RAID , in which Erasure Coding is implemented by stripping. This logically stores the data in the form of a block, and stores the block on the different disk. For each block, the parity is calculated and stored. This is the encoding, and any error can be recovered by back calculating using the parity.

The new release also includes a preview of the YARN Timeline Service v.2, which offers better scalability, reliability, and usability of the Timeline Service. The service is responsible for persisting application specific information, and for persisting generic information about completed applications.

Support for YARN resource types has also been added, making it possible to schedule additional resources such as disks and GPUs for better integration with machine learning and container workloads.

Other improvements include the ability to federate YARN and HDFS subclusters transparently; and opportunistic container execution to improve resource utilization and increase task throughput for short-lived containers. Support for cloud storage systems such as Amazon S3  and Azure Data Lake has also been improved.

hadooplogo

More Information

Apache Hadoop Site 

Related Articles

Hadoop 2.9 Adds Resource Estimator

Hadoop Adds In-Memory Caching

Hadoop SQL Query Engine Launched

Hadoop 2 Introduces YARN

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin.

 

Banner


Hadoopi - Raspberry Pi Hadoop Cluster
17/07/2018

There's an updated version of Hadoopi, a Hadoop distribution for the Raspberry Pi. Hadoopi supports various components of the Hadoop ecosystem including HBase, Hive, and Spark. The new release has wir [ ... ]



Fear And Loathing In The Cloud
04/07/2018

The cloud has many advantages, but there is one very big downside - you hand over control of your major assets to a third party that you simply have to trust. Is the thought that you could be held to  [ ... ]


More News

 

justjsquare

 



 

Comments




or email your comment to: comments@i-programmer.info

 
 

   
Banner
Banner
RSS feed of news items only
I Programmer News
Copyright © 2018 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.