Apache MADlib Adds HITS Implementation
Apache MADlib Adds HITS Implementation
Written by Kay Ewbank   
Wednesday, 10 January 2018

There's a new version of Apache MADlib with new features including an implementation of HITS. MADlib makes it possible to carry out  big data machine learning from SQL

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. It currently supports PostgreSQL, Greenplum Database, and Apache HAWQ. It started as a collaboration between a team at UC Berkeley and developers at Pivotal. Pivotal was previously known as EMC Greenplum. The project was added to Apache as an incubator project in 2015.

MADlib uses the MPP (Massively Parallel Processing) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. It runs as a fully parallelized implementation on GPDB (Greenplum Database)  and HAWQ for large data sets, meaning it offers a much better performance than R or Python libraries. It is scalable due to the ability to add more nodes to achieve higher performance as your data scales.  Greenplum Database is an advanced, fully featured, open source data platform designed for analyzing petabyte scale data volumes. HAWQ is Apache Hadoop Native SQL Advanced Analytics MPP Database for Enterprises, and is currently an Apache Incubator project.

When MADlib was made a top level project in August 2017, Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original authors of MADlib, said:

"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics."

The new release, 1.13, of MADlib has a new HITS (Hyperlink-Induced Topic Search) link analysis algorithm. HITS provides a way to analyze links to rate web pages.

Another improvement to the new release is better handling of k-nearest neighbors classification. k-NN in MADlib now has more distance metrics, and the ability to show a list of neighbors in the output table.

Grouping support has been added to MLP (MultiLayer Perceptron), and the quality of results for correlation analysis has been improved by ignoring only a NULL value and not the whole row containing the NULL.

madlib 

 

More Information

MADlib site

Related Articles

Apache PredictionIO Reaches Top Level Status

Azure Machine Learning Enhancements

Amazon's Giant Push Into Machine Learning

Spark Gets NLP Library

 

 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin.

 

Banner


Google Code-in 2017 An Epic Achievement
05/02/2018

The 8th annual Google Code-in contest in which teenage students learn about open source and make practical contributions turned out to be "epic" - with 3555 students from 78 countries completing  [ ... ]



SQLite Adds Zipfile Support
25/01/2018

There's a new version of SQLIte with support for Zip files, an improved query planner, and a sqlite_btreeinfo virtual table.


More News

 

 
 

 

blog comments powered by Disqus

Last Updated ( Wednesday, 10 January 2018 )
 
 

   
Banner
RSS feed of news items only
I Programmer News
Copyright © 2018 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.