New Database For Data Scientists
Written by Kay Ewbank   
Wednesday, 20 November 2019

A new database designed to help data science teams make faster discoveries by giving them a more powerful way to store, update, analyze, and share large sets of diverse data has been released.

TileDB consists of a new multi-dimensional array data format, a fast, embeddable, open-source C++ storage engine with data science tooling integrations, and a cloud service for easy data management and serverless computations.

tiledb

The developers say traditional databases aren't ideal for data science use as they're not cloud-optimized, while cloud object stores suffer from object immutability, eventual consistency, and IO request limiting. A second problem is that some formats lack sufficient support for efficient data updates. They give the example of updating a Parquet file requiring the creation of a new file, pushing the entire update logic to the user’s higher-level application, and say similar problems arise whenever the update logic is not built into the format and storage engine, but it is rather delegated to higher-level applications.

Finally, the developers cite limited scope as a problem, on the basis that most data science applications require at least two separate file formats to handle both array data and dataframes; multi-dimensional arrays for uses such as linear algebra; and dataframes for OLAP operations.

The team started with the storage layer when creating TileDB, and say it has the only format and storage engine that handles both dense and sparse multi-dimensional arrays. It supports efficient array IO on multiple storage backends, including cloud object stores like AWS S3. It also offers rapid, highly parallel, lock-free, batch updates that are designed to work particularly well on the cloud with immutable objects. All update logic and functionality (like time traveling) is built into the format and storage engine.

TileDB offers a standalone, embeddable C++ library that ships with APIs in C, C++, Python, R, Java and Go, and has direct access to TileDB arrays. The library is integrated with Spark, Dask, PrestoDB, MariaDB, Arrow and geospatial libraries like PDAL, GDAL and Rasterio. TileDB pushes down as much computation as possible to storage, such as filter conditions from the SQL engines and  dataframe computations from Dask and Spark.

Alongside the database is TileDB Cloud, a pay as you use priced service that you can use to share TileDB arrays on the cloud with other users and perform serverless computations on them. Both TileDB and TileDB Cloud are available to try now.

tiledb

More Information

TileDB Homepage

Related Articles

Databricks Delta Adds Faster Parquet Import

Databricks Delta Lake Now Open Source

RAPIDS GPU Data Analysis Platform Launched 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, Facebook or Linkedin.

Banner


Dart 2.6 Adds Native Linux Support
19/11/2019

Google's Dart has increased support for native, ahead-of-time (AOT) compilation for Linux, Windows and MacOS. The extra support comes from an extension of Dart's existing compiler set called dart2nati [ ... ]



Google Offers Bug Bounty Up to $1.5 Million
25/11/2019

Google has announced a new bug bounty of $1 million for a full chain remote code execution exploit with persistence which compromises the Titan M secure element on Pixel devices. This can be boosted t [ ... ]


More News

graphics

 



 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 20 November 2019 )