Apache DataSketches Reaches Top Level Status
Written by Kay Ewbank   
Thursday, 11 February 2021

Apache DataSketches has reached top-level project status. The data analysis software was originally developed at Yahoo, and has been an Apache incubator project for the last two years.

DataSketches is an open source, high-performance library of stochastic streaming algorithms commonly called "sketches" in the data sciences. Sketches are small, stateful programs that process massive data as a stream and can provide approximate answers, with mathematical guarantees, to computationally difficult queries orders-of-magnitude faster than traditional, exact methods.


The developers of DataSketches say such sketches are important for any system that needs to extract useful information from big data, and that sketches should be tightly integrated into the analysis capabilities of such systems. Sketches implement algorithms that can extract information from a stream of data in a single pass, aka “one-touch” processing

The DataSketches technology has helped Yahoo (Verizon Media) successfully reduce data processing times from days or hours to minutes or seconds on a number of its internal platforms. The DataSketches project is dedicated to providing a broad selection of sketch algorithms of production quality.

The usefulness of sketches comes down to the fact that businesses don't always need answers that are pinpoint accurate. If an approximate answer is acceptable, then sketches algorithms allow you to answer these queries orders-of-magnitude faster, with much lower resource utilization.

Instead of requiring the data analysis system to keep enormous data on-hand, sketches have small data structures that are usually kilobytes in size. Sketches are also streaming algorithms, in that they only need to see each incoming item once.

The DataSketches library has been specifically designed for production systems that must process massive data. It includes adaptors for Apache Hive, Apache Pig, and PostgreSQL (C++), and these adaptors are designed to provide examples for adaptors for other systems. The sketches in this library are also designed to have compatible binary representations across languages (Java, C++, Python) and platforms.


More Information

DataSketches Website

Related Articles

Apache Hive Adds Support For Set Operations

SQL At Hadoop Scale

Hadoop SQL Query Engine Launched

PostgreSQL Multi-Model Graph Extension Announced

Hive on Hadoop for MongoDB

DataFu for Pig and Hadoop

Google Data Studio Improves Analytics


To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.


ZLUDA Ports CUDA Applications To AMD GPUs

ZLUDA is a translation layer that lets you run unmodified CUDA applications with near-native performance on AMD GPUs. But it is walking a fine line with regards to legality.

Excel Spreadsheet - A Joke?

No this isn't an April Fool's although in places it seems like one. It's a true account of how Williams Racing has suffered through reliance on an overgrown and outdated Microsoft Excel spreadsheet, l [ ... ]

More News

raspberry pi books



or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 11 February 2021 )