Apache DataSketches Reaches Top Level Status
Written by Kay Ewbank   
Thursday, 11 February 2021

Apache DataSketches has reached top-level project status. The data analysis software was originally developed at Yahoo, and has been an Apache incubator project for the last two years.

DataSketches is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.

datasketches

The developers of DataSketches say such sketches are important for any system that needs to extract useful information from big data, and that sketches should be tightly integrated into the analysis capabilities of such systems. Sketches implement algorithms that can extract information from a stream of data in a single pass, aka “one-touch” processing

The DataSketches technology has helped Yahoo (Verizon Media) successfully reduce data processing times from days or hours to minutes or seconds on a number of its internal platforms. The DataSketches project is dedicated to providing a broad selection of sketch algorithms of production quality.

The usefulness of sketches comes down to the fact that businesses don't always need answers that are pinpoint accurate. If an approximate answer is acceptable, then sketches algorithms allow you to answer these queries orders-of-magnitude faster, with much lower resource utilization.

Instead of requiring the data analysis system to keep enormous data on-hand, sketches have small data structures that are usually kilobytes in size. Sketches are also streaming algorithms, in that they only need to see each incoming item once.

The DataSketches library has been specifically designed for production systems that must process massive data. It includes adaptors for Apache Hive, Apache Pig, and PostgreSQL (C++), and these adaptors are designed to provide examples for adaptors for other systems. The sketches in this library are also designed to have compatible binary representations across languages (Java, C++, Python) and platforms.

 datasketches

More Information

DataSketches Website

Related Articles

Apache Hive Adds Support For Set Operations

SQL At Hadoop Scale

Hadoop SQL Query Engine Launched

PostgreSQL Multi-Model Graph Extension Announced

Hive on Hadoop for MongoDB

DataFu for Pig and Hadoop

Google Data Studio Improves Analytics

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Does AI Copy Code - Lawsuit Says No
10/07/2024

Are we worried about AI code assistants? Well some of us were worried and offended enough to take GitHub/ Microsoft and Open AI to court over code copying by GitHub Copilot. But the judge came down on [ ... ]



Eclipse Releases Theia IDE
27/06/2024

The Eclipse Foundation has released Theia IDE, which they say is created for developers seeking a modern, open, and flexible tool for their coding pursuits. The IDE is based on the Theia Platform, whi [ ... ]


More News

kotlin book

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 11 February 2021 )