|Apache MADlib Adds HITS Implementation|
|Written by Kay Ewbank|
|Wednesday, 10 January 2018|
There's a new version of Apache MADlib with new features including an implementation of HITS. MADlib makes it possible to carry out big data machine learning from SQL
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. It currently supports PostgreSQL, Greenplum Database, and Apache HAWQ. It started as a collaboration between a team at UC Berkeley and developers at Pivotal. Pivotal was previously known as EMC Greenplum. The project was added to Apache as an incubator project in 2015.
MADlib uses the MPP (Massively Parallel Processing) architecture’s full compute power to process very large data sets, whereas other products are limited by the amount of data that can be loaded into memory on a single node. It runs as a fully parallelized implementation on GPDB (Greenplum Database) and HAWQ for large data sets, meaning it offers a much better performance than R or Python libraries. It is scalable due to the ability to add more nodes to achieve higher performance as your data scales. Greenplum Database is an advanced, fully featured, open source data platform designed for analyzing petabyte scale data volumes. HAWQ is Apache Hadoop Native SQL Advanced Analytics MPP Database for Enterprises, and is currently an Apache Incubator project.
When MADlib was made a top level project in August 2017, Joe Hellerstein, Professor of Computer Science at UC Berkeley, Co-Founder and Chief Strategy Officer at Trifacta, and one of the original authors of MADlib, said:
"MADlib was conceived from the outset as an open-source meeting ground for software developers, computing researchers and data scientists to collaborate on scalable, in-database machine learning and statistics."
The new release, 1.13, of MADlib has a new HITS (Hyperlink-Induced Topic Search) link analysis algorithm. HITS provides a way to analyze links to rate web pages.
Another improvement to the new release is better handling of k-nearest neighbors classification. k-NN in MADlib now has more distance metrics, and the ability to show a list of neighbors in the output table.
Grouping support has been added to MLP (MultiLayer Perceptron), and the quality of results for correlation analysis has been improved by ignoring only a NULL value and not the whole row containing the NULL.
or email your comment to: email@example.com
|Last Updated ( Wednesday, 10 January 2018 )|