Apache DataSketches Reaches Top Level Status
Written by Kay Ewbank   
Thursday, 11 February 2021

Apache DataSketches has reached top-level project status. The data analysis software was originally developed at Yahoo, and has been an Apache incubator project for the last two years.

DataSketches is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.

datasketches

The developers of DataSketches say such sketches are important for any system that needs to extract useful information from big data, and that sketches should be tightly integrated into the analysis capabilities of such systems. Sketches implement algorithms that can extract information from a stream of data in a single pass, aka “one-touch” processing

The DataSketches technology has helped Yahoo (Verizon Media) successfully reduce data processing times from days or hours to minutes or seconds on a number of its internal platforms. The DataSketches project is dedicated to providing a broad selection of sketch algorithms of production quality.

The usefulness of sketches comes down to the fact that businesses don't always need answers that are pinpoint accurate. If an approximate answer is acceptable, then sketches algorithms allow you to answer these queries orders-of-magnitude faster, with much lower resource utilization.

Instead of requiring the data analysis system to keep enormous data on-hand, sketches have small data structures that are usually kilobytes in size. Sketches are also streaming algorithms, in that they only need to see each incoming item once.

The DataSketches library has been specifically designed for production systems that must process massive data. It includes adaptors for Apache Hive, Apache Pig, and PostgreSQL (C++), and these adaptors are designed to provide examples for adaptors for other systems. The sketches in this library are also designed to have compatible binary representations across languages (Java, C++, Python) and platforms.

 datasketches

More Information

DataSketches Website

Related Articles

Apache Hive Adds Support For Set Operations

SQL At Hadoop Scale

Hadoop SQL Query Engine Launched

PostgreSQL Multi-Model Graph Extension Announced

Hive on Hadoop for MongoDB

DataFu for Pig and Hadoop

Google Data Studio Improves Analytics

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


A Crash Course on Python By Google
13/04/2021

There's a free new Google course on Coursera for learning to program with Python.No previous exposure to programming required.



Amazon Announces OpenSearch
16/04/2021

Amazon has announced an open source search and analytics suite. OpenSearch is an open source fork of Elasticsearch and Kibana.


More News

square

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 11 February 2021 )