Apache MADlib Adds HITS Implementation
Written by Kay Ewbank   
Wednesday, 10 January 2018

There's a new version of Apache MADlib with new features including an implementation of HITS. MADlib makes it possible to carry out  big data machine learning from SQL

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. It currently supports PostgreSQL, Greenplum Database, and Apache HAWQ. It started as a collaboration between a team at UC Berkeley and developers at Pivotal. Pivotal was previously known as EMC Greenplum. The project was added to Apache as an incubator project in 2015.

MADlib uses the MPP (Massively Parallel Processing) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. It runs as a fully parallelized implementation on GPDB (Greenplum Database)  and HAWQ for large data sets, meaning it offers a much better performance than R or Python libraries. It is scalable due to the ability to add more nodes to achieve higher performance as your data scales.  Greenplum Database is an advanced, fully featured, open source data platform designed for analyzing petabyte scale data volumes. HAWQ is Apache Hadoop Native SQL Advanced Analytics MPP Database for Enterprises, and is currently an Apache Incubator project.

When MADlib was made a top level project in August 2017, Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original authors of MADlib, said:

"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics."

The new release, 1.13, of MADlib has a new HITS (Hyperlink-Induced Topic Search) link analysis algorithm. HITS provides a way to analyze links to rate web pages.

Another improvement to the new release is better handling of k-nearest neighbors classification. k-NN in MADlib now has more distance metrics, and the ability to show a list of neighbors in the output table.

Grouping support has been added to MLP (MultiLayer Perceptron), and the quality of results for correlation analysis has been improved by ignoring only a NULL value and not the whole row containing the NULL.



More Information

MADlib site

Related Articles

Apache PredictionIO Reaches Top Level Status

Azure Machine Learning Enhancements

Amazon's Giant Push Into Machine Learning

Spark Gets NLP Library




To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.



Perl 5.36 Released - What's New?

Perl 5.36 was recently released and comes with many great features. It's a prelude to Perl 7 but might prove more than that since 7's future is still uncertain.

Whirlwind I Shut Down 65 Years Ago

On May 27, 1957 MIT shut down its Whirlwind I computer, after almost a decade of service. The world's first real-time digital computer, Whirlwind was the first to use magnetic-core memo [ ... ]

More News






or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 10 January 2018 )