Apache DataSketches Reaches Top Level Status
Written by Kay Ewbank   
Thursday, 11 February 2021

Apache DataSketches has reached top-level project status. The data analysis software was originally developed at Yahoo, and has been an Apache incubator project for the last two years.

DataSketches is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.

datasketches

The developers of DataSketches say such sketches are important for any system that needs to extract useful information from big data, and that sketches should be tightly integrated into the analysis capabilities of such systems. Sketches implement algorithms that can extract information from a stream of data in a single pass, aka “one-touch” processing

The DataSketches technology has helped Yahoo (Verizon Media) successfully reduce data processing times from days or hours to minutes or seconds on a number of its internal platforms. The DataSketches project is dedicated to providing a broad selection of sketch algorithms of production quality.

The usefulness of sketches comes down to the fact that businesses don't always need answers that are pinpoint accurate. If an approximate answer is acceptable, then sketches algorithms allow you to answer these queries orders-of-magnitude faster, with much lower resource utilization.

Instead of requiring the data analysis system to keep enormous data on-hand, sketches have small data structures that are usually kilobytes in size. Sketches are also streaming algorithms, in that they only need to see each incoming item once.

The DataSketches library has been specifically designed for production systems that must process massive data. It includes adaptors for Apache Hive, Apache Pig, and PostgreSQL (C++), and these adaptors are designed to provide examples for adaptors for other systems. The sketches in this library are also designed to have compatible binary representations across languages (Java, C++, Python) and platforms.

 datasketches

More Information

DataSketches Website

Related Articles

Apache Hive Adds Support For Set Operations

SQL At Hadoop Scale

Hadoop SQL Query Engine Launched

PostgreSQL Multi-Model Graph Extension Announced

Hive on Hadoop for MongoDB

DataFu for Pig and Hadoop

Google Data Studio Improves Analytics

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.

Banner


Python Adopts Pattern Matching - Kitchen Sink Next
10/02/2021

Python is a remarkable language, but its latest addition is making some programmers thing that maybe it has lost its way - although it really does depend on what you think that way is. Pattern matchin [ ... ]



What Makes Python Great & Greater
03/03/2021

In this second look at the results of the Python Developers Survey 2020, we focus on the features in Python that developers value and those they would like to see in future versions.


More News

square

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 11 February 2021 )