Apache Hive Adds Support For Set Operations
Apache Hive Adds Support For Set Operations
Written by Kay Ewbank   
Wednesday, 26 July 2017

There's a new release of Apache Hive with new features including support for Set operations and a JDBC Storage Handler. 

Hive can be used to read, write and manage large datasets in distributed storage using SQL. The software includes a command line tool and JDBC driver for connecting users to Hive. Tools are provided for data extract/transform/load (ETL). It can be used to query data via MapReduce, Spark and Tez. Query retrieval can make use of Hive LLAP, YARN and Slider. Hive also supports procedural use of HPL-SQL.

hive

The latest version adds a generic JDBC RDBMS Storage Handler, making it possible to import a standard DB table into Hive.

This release also completes the work begun in Hive 2.1 on Set operations. You can now use Union, Intersect and Except Set operations to find data using relational algebra. 

Handling of ACID transactions has been improved in two ways. Firstly, the new release enables predicate pushdown to delta files created by ACID Transactions. In earlier versions, a delta file created by an ACID transaction didn't allow predicate pushdown if they contain any update/delete events. This was done deliberately to preserve correctness in the case of the transaction failing on a multi-version transaction. The new approach splits updates into a combination of a delete event followed by a new insert event. This means predicate push down can be enabled to all delta files without breaking correctness.

ACID vectorization has also been improved through the elimination of row-by-row stitching. In earlier versions, a vectorized row batch was created by populating the batch one row at a time, before the vectorized batch was passed up along the operator pipeline. This was done because of the fact that the ACID insert/update/delete events from various delta files needed to be merged together before the actual version of a given row was found out. The improvements to delta file handling mean this is no longer necessary. The updated version directly reads row batches from the underlying ORC files and avoids any stitching. 

Once a row batch is read from the split, deleted rows will be found by cross-referencing them against a data structure that will just keep track of deleted events. This is expected to lead to a large performance gain when reading ACID files in vectorized fashion.

Other improvements include the addition of simple materialized views with manual rebuilds; support for listing views similar to "show tables"; and a UDF to allow interrogation of uniontype values.

hive

 

More Information

Apache Hive

Release Notes

Related Articles

Hive on Hadoop for MongoDB

SQL At Hadoop Scale

Spark BI Gets Fine Grain Security

Hadoop Adds In-Memory Caching

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on, Twitter, FacebookGoogle+ or Linkedin.

 

Banner


Flash Finally Declared Dead - It Was Murder
26/07/2017

Adobe has finally announced that Flash will be no more after 2020. If you are one of the many programmers who thought that Flash was already dead this will come as a surprise, but presumably not an un [ ... ]



Real World Adversarial Images
09/08/2017

Just when you thought the the adversarial image flaw in neural networks couldn't get any worse someone comes along and shows how to build such images in the real world. Yes, a stop sign can be changed [ ... ]


More News

 

 
 

 

blog comments powered by Disqus

Last Updated ( Wednesday, 26 July 2017 )
 
 

   
Banner
RSS feed of news items only
I Programmer News
Copyright © 2017 i-programmer.info. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.