Spark BI Gets Fine Grain Security

Written by Kay Ewbank

Friday, 06 January 2017

There's a new technique for adding fine grain security when using Apache Hive and Spark to work with large data sets.

Spark lets you use SQL expressions on data in Hive, but authorization has until now required you to use HDFS ACLs. This lacks the granularity needed with columnar data. While the ideal solution would be if Spark could recognize and respond to fine grain security settings, one alternative is to use an external daemon that can interact with schema level security settings.

LLAP (Live Long and Process, groan) is a collection of long lived daemons that works in tandem with the HDFS Data Node service, and provides the ability to interact with schema based security. LLAP was introduced in Hive 2. It is a hybrid execution model that offers benefits such as caching of columnar data, JIT-friendly operator pipelines, and reduced overhead for multiple queries (including concurrent queries).

Explaining the technique on the HortonWorks blog, Vadim Vaks said that with LLAP enabled, Spark reads from HDFS directly through LLAP, meaning the only other element needed is a centralized authorization system, which can be provided by Apache Ranger. This provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS, including HDFS, Yarn, Hive (Spark with LLAP), HBase, Kafka, Storm, Solr, Atlas and Knox. Vaks says:

"Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time."

So Spark receives the query statement and communicates with Hive to obtain the relevant schemas and query plan. The Ranger Hive plugin is then used to check the cached security policy and inform Spark which columns it can access.

Apache Ranger provides a centralized security framework to manage fine grained access control over Hadoop and related components (Apache Hive, HBase etc.). The Ranger plugin sits in the path of the user request and is able to make a decision on whether the user request shoud be authorized. The plugin also collects access request details required for auditing.

Once Spark has been told which columns it can access, it then uses LLAP to read from the filesystem. LLAP deals with any filtering or masking, and if the query contains requests for columns that aren't authorized, LLAP stops processing the request and throws an Authorization exception to Spark. If masking is used, the restricted columns are returned but containing only asterisks or a hash of the original value.

Ranger can also be used to offer row level security, so a query will return only the rows that a user has permission to see. As Vaks explains, a row level policy from Ranger would instruct Hive to return a query plan that includes a predicate that filters unauthorised rows. Spark receives the modified query plan and initiates processing, reading data through LLAP. LLAP ensures that the predicate is applied and that the restricted rows are not returned.

sparklogo

More Information

HortonWorks Blog Post On LLAP

LLAP On Hive Wiki

Spark LLAP Tutorial

Apache Ranger FAQ

Apache Spark 2.0 Released

Apache Spark Technical Preview

Spark Announcements

Apache Releases Spark 1.6

Spark 1.4 Released

SPARQL Moves Closer

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

NVIDIA CUDA Dive Using Python
15/05/2025

NVIDIA adds native support to CUDA for Python, making it more accessible to developers at large.

+ Full Story

Three Tools To Run MCP On Your Github Repositories
03/06/2025

Understand a Github repository by using three different
MCP solutions. Github Chat MCP, Git MCP and the official
Github MCP Server.

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments