Spark BI Gets Fine Grain Security
Written by Kay Ewbank   
Friday, 06 January 2017

There's a new technique for adding fine grain security when using Apache Hive and Spark to work with large data sets.

 Spark lets you use SQL expressions on data in Hive, but authorization has until now required you to use HDFS ACLs. This lacks the granularity needed with columnar data. While the ideal solution would be if Spark could recognize and respond to fine grain security settings, one alternative is to use an external daemon that can interact with schema level security settings.

LLAP (Live Long and Process, groan) is a collection of long lived daemons that works in tandem with the HDFS Data Node service, and provides the ability to interact with schema based security. LLAP was introduced in Hive 2. It is a hybrid execution model that offers benefits such as caching of columnar data, JIT-friendly operator pipelines, and reduced overhead for multiple queries (including concurrent queries). 

Explaining the technique on the HortonWorks blog, Vadim Vaks said that with LLAP enabled, Spark reads from HDFS directly through LLAP, meaning the only other element needed is a centralized authorization system, which can be provided by Apache Ranger.  This provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS, including HDFS, Yarn, Hive (Spark with LLAP), HBase, Kafka, Storm, Solr,  Atlas and Knox. Vaks says:

"Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time."

 So Spark receives the query statement and communicates with Hive to obtain the relevant schemas and query plan. The Ranger Hive plugin is then used to check the cached security policy and inform Spark which columns it can access. 

Apache Ranger provides a centralized security framework to manage fine grained access control over Hadoop and related components (Apache Hive, HBase etc.). The Ranger plugin sits in the path of the user request and is able to make a decision on whether the user request shoud be authorized. The plugin also collects access request details required for auditing.

Once Spark has been told which columns it can access, it then uses LLAP to read from the filesystem. LLAP deals with any filtering or masking, and if the query contains requests for columns that aren't authorized, LLAP stops processing the request and throws an Authorization exception to Spark. If masking is used, the restricted columns are returned but containing only asterisks or a hash of the original value.

Ranger can also be used to offer row level security, so a query will return only the rows that a user has permission to see. As Vaks explains, a row level policy from Ranger would instruct Hive to return a query plan that includes a predicate that filters unauthorised rows. Spark receives the modified query plan and initiates processing, reading data through LLAP. LLAP ensures that the predicate is applied and that the restricted rows are not returned.


sparklogo

 

More Information

HortonWorks Blog Post On LLAP

LLAP On Hive Wiki

Spark LLAP Tutorial

Apache Ranger FAQ

Related Articles 

Apache Spark 2.0 Released

Apache Spark Technical Preview

Spark Announcements

Apache Releases Spark 1.6

Spark 1.4 Released

SPARQL Moves Closer

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

 

Banner


CouchDB 3.4 Strengthens Password Hashes
03/10/2024

CouchDB 3.41 has been released with stronger password hashes, a Lucene-based full text search implementation, and QuickJS as a JavaScript option.



Low Cost ESP32 Drone On Kickstarter
22/09/2024

If you have been looking for an quadcopter that you can experiment with then you might well be interested in LiteWing. It uses an ESP32 to create a $60 drone.


More News

 

kotlin book

 

Comments




or email your comment to: comments@i-programmer.info