What is Apache Hive?
Apache Hive is a data warehouse system for data summarization and analysis and for querying of large data systems in the open-source Hadoop platform. It converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data.
The three important functionalities for which Hive is deployed are data summarization, data analysis, and data query. The query language, exclusively supported by Hive, is HiveQL. This language translates SQL-like queries into MapReduce jobs for deploying them on Hadoop. HiveQL also supports MapReduce scripts that can be plugged into the queries. Hive increases schema design flexibility and also data serialization and deserialization.
Hive is best suited for batch jobs, rather than working with web log data and append-only data. It cannot work for online transaction processing (OLTP) systems since it does not provide real-time querying for row-level updates.
Important characteristics of Hive
- In Hive, tables and databases are created first and then data is loaded into these tables.
- Hive is a data warehouse designed for managing and querying only structured data that is stored in tables.
- While dealing with structured data, Map Reduce doesn’t have optimization and usability features like UDFs but Hive framework does. Query optimization refers to an effective way of query execution in terms of performance.
- Hive’s SQL-inspired language separates the user from the complexity of Map Reduce programming. It reuses familiar concepts from the relational database world, such as tables, rows, columns, and schema, etc. for ease of learning.
- Hadoop’s programming works on flat files. So, Hive can use directory structures to “partition” data to improve performance on certain queries.
- A new and important component of Hive i.e. Metastore used for storing schema information. This Metastore typically resides in a relational database.
Features of Hive
- It stores schema in a database and processed data into HDFS.
- It is designed for OLAP.
- It provides SQL type language for querying called HiveQL or HQL.
- It is familiar, fast, scalable, and extensible.
Hadoop Hive Architecture
Apache Hive supports all applications written in languages like C++, Java, Python, etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write a Hive client application written in the language of their choice.
Hive provides various services like web Interface, CLI, etc. to perform queries.
Processing framework and Resource Management
Hive internally uses the Hadoop MapReduce framework to execute the queries.
As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.