Dremio 3.0 Adds Data Catalog
Written by Kay Ewbank   
Tuesday, 13 November 2018

There's a new version of Dremio, an open-source project designed to give business analysts and data scientists a way to explore and analyze data no matter what its structure or size. New in this release are a data catalog, prioritized workload management, and Kubernetes support.

The developers of Dremio describe it as a data virtualization platform. The software is based on Apache Arrow, Apache Parquet, and Apache Calcite, and the company behind Dremio is a major contributor to Arrow. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data. Apache Parquet offers similar features for file-based storage. uses Apache Calcite is used for SQL parsing and query optimization.

dremio

Dremio builds Arrow-based structures called Reflections. These are optimized copies of data based on queries against data sources.  Dremio also has a query optimizer that uses Apache Arrow to work out the best representation of data to make the query faster. This might mean that a query against an ElasticSearch cluster (for example) would use the Arrow representation of the data instead.

Dremio also has a built-in SQL based query language that provides similar features to those of cost-based optimizers such as SparkSQL, but with the addition of Reflections to take the idea further by providing the optimized copy of the data.

The new version of Dremio adds a data catalog with the idea that users will be able to carry out a simple Google-like search to find datasets. Under the covers, Dremio administrators tag datasets to organize them so they can be discovered by data consumers. The catalog includes built-in wiki pages where information can be stored such as who to ask questions, how often the data is updated, what sources of data make up the dataset, and screen shots of reports and visualizations that use the dataset.

This release also includes support for Gandiva, a new execution kernel for Arrow that is based on LLVM. Gandiva provides performance improvements for low-level operations on Arrow buffers. The developers say in the right circumstances, using Gandiva can improve query performance dramatically - some early testers have reported improvements of over 70x.

Security has been improved with native integration with Apache Ranger for centralized access control. In addition, Dremio 3.0 now supports end-to-end TLS encryption.

New multi-tenant workload controls have been added so that administrators can control resource allocation based on user, group membership, time of day, data source, and query type using standard SQL.

The Kubernetes support comes via an official Docker image and templates for elastic, highly available deployments using the Kubernetes orchestration framework.

Elsewhere there's a new declarative engine for relational database sources that is designed to provide more efficient processing on systems such as Postgres, SQL Server, Oracle, and Teradata; and support for new daa sources including Azure Data Lake Store, Elasticsearch 6, AWS S3 GovCloud, and Teradata.

dremio

 

 

 

More Information

Dremio Website

Related Articles

Apache Arrow Adds Streaming Binary Format

Apache Kylin 2.5 Adds All-in-Spark Cubing Engine

Kylin 2.3.0 Adds SQL Server Support

Apache Kylin Gets Table Level ACL Management

Apache Kylin Adds RDBMS Support 

Spark BI Gets Fine Grain Security

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


pgxman - PostgreSQL Extension Manager
19/02/2024

pgxman is a package manager like npm, but instead of Javascript packages, it is for PostgreSQL extensions. It detects and streamlines extension operations and looks after dependency manageme [ ... ]



Chainguard Joins Docker Verified Publisher Program
15/03/2024

Chainguard has joined the Docker Verified Publisher (DVP) program, meaning its Chainguard Developer Images are now officially available on Docker's container image registry.


More News

raspberry pi books

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Tuesday, 13 November 2018 )