There's a new technique for adding fine grain security when using Apache Hive and Spark to work with large data sets.
Spark lets you use SQL expressions on data in Hive, but authorization has until now required you to use HDFS ACLs. This lacks the granularity needed with columnar data. While the ideal solution would be if Spark could recognize and respond to fine grain security settings, one alternative is to use an external daemon that can interact with schema level security settings.
LLAP (Live Long and Process, groan) is a collection of long lived daemons that works in tandem with the HDFS Data Node service, and provides the ability to interact with schema based security. LLAP was introduced in Hive 2. It is a hybrid execution model that offers benefits such as caching of columnar data, JIT-friendly operator pipelines, and reduced overhead for multiple queries (including concurrent queries).
Explaining the technique on the HortonWorks blog, Vadim Vaks said that with LLAP enabled, Spark reads from HDFS directly through LLAP, meaning the only other element needed is a centralized authorization system, which can be provided by Apache Ranger. This provides centralized authorization and audit services for many components that run on Yarn or rely on data from HDFS, including HDFS, Yarn, Hive (Spark with LLAP), HBase, Kafka, Storm, Solr, Atlas and Knox. Vaks says:
"Each of the above services integrate with Ranger via a plugin that pulls the latest security policies, caches them, and then applies them at run time."
So Spark receives the query statement and communicates with Hive to obtain the relevant schemas and query plan. The Ranger Hive plugin is then used to check the cached security policy and inform Spark which columns it can access.
Apache Ranger provides a centralized security framework to manage fine grained access control over Hadoop and related components (Apache Hive, HBase etc.). The Ranger plugin sits in the path of the user request and is able to make a decision on whether the user request shoud be authorized. The plugin also collects access request details required for auditing.
Once Spark has been told which columns it can access, it then uses LLAP to read from the filesystem. LLAP deals with any filtering or masking, and if the query contains requests for columns that aren't authorized, LLAP stops processing the request and throws an Authorization exception to Spark. If masking is used, the restricted columns are returned but containing only asterisks or a hash of the original value.
Ranger can also be used to offer row level security, so a query will return only the rows that a user has permission to see. As Vaks explains, a row level policy from Ranger would instruct Hive to return a query plan that includes a predicate that filters unauthorised rows. Spark receives the modified query plan and initiates processing, reading data through LLAP. LLAP ensures that the predicate is applied and that the restricted rows are not returned.
The view that serverless is the way of the future is once more reinforced by Google's recent announcement of its Cloud Functions for Firebase. These can be seen as an attempt to catch up with the alre [ ... ]
Rumors that Google was acquiring the data science community Kaggle were confirmed at the Google Cloud Next Conference yesterday. This confers the benefit of the ability to store and query large datase [ ... ]