Architecture - Simplified Learning

Hive Architecture

The Apache Hive components are

Metastore

Metastore stores metadata for each of the tables like their schema and location. Hive also includes the partition metadata. This helps the driver to track the progress of various data sets distributed over the cluster. It stores the data in a traditional RDBMS format. Hive metadata helps the driver to keep a track of the data and it is highly crucial. Backup server regularly replicates the data which it can retrieve in case of data loss.

Driver

Driver acts like a controller which receives the HiveQL statements. The driver starts the execution of statement by creating sessions. It monitors the life cycle and progress of the execution. Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection point of data or query result obtained after the Reduce operation.

Compiler

Compiler performs the compilation of the HiveQL query. This converts the query to an execution plan. The plan contains the tasks. It also contains steps needed to be performed by the MapReduce to get the output as translated by the query. The compiler in Hive converts the query to an Abstract Syntax Tree (AST). First, check for compatibility and compile time errors, then converts the AST to a Directed Acyclic Graph (DAG).

Optimizer

It performs various transformations on the execution plan to provide optimized DAG. It aggregates the transformations together, such as converting a pipeline of joins to a single join, for better performance. The optimizer can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance.

Executor

Once compilation and optimization complete, the executor executes the tasks. Executor takes care of pipelining the tasks.

CLI, UI, and Thrift Server

CLI (command-line interface) provide a user interface for an external user to interact with Hive. Thrift server in Hive allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocols.

Hive Shell

Using Hive shell, we can interact with the Hive; we can issue our commands or queries in HiveQL inside the Hive shell. Hive Shell is almost similar to MySQL Shell. It is the command line interface for Hive. In Hive Shell users can run HQL queries. HiveQL is also case-insensitive (except for string comparisons) same as SQL. We can run the Hive Shell in two modes which are: Non-Interactive mode and Interactive mode.

1. Non-Interactive mode

Hive Shell can be run in the non-interactive mode, with -f option we can specify the location of a file which contains HQL queries. For example- hive -f my-script.q

2. Interactive mode

Hive Shell can also be run in the interactive mode. In interactive mode, we directly need to go to the hive shell and run the queries there. In hive shell, we can submit required queries manually and get the result. For example- $bin/hive, go to hive shell.

Features of Hive

There are so many features of Apache Hive. Let’s discuss them one by one-

Hive provides data summarization, query, and analysis in much easier manner.
Hive supports external tables which make it possible to process data without actually storing in HDFS.
Apache Hive fits the low-level interface requirement of Hadoop perfectly.
It also supports partitioning of data at the level of tables to improve performance.
Hive has a rule-based optimizer for optimizing logical plans.
It is scalable, familiar, and extensible.
Using HiveQL doesn’t require any knowledge of programming language, Knowledge of basic SQL query is enough.
We can easily process structured data in Hadoop using Hive.
Querying in Hive is very simple as it is similar to SQL.
We can also run Ad-hoc queries for the data analysis using Hive.

Limitations of Hive

Hive does not offer real-time queries.
Hive does not offer row-level updates or deletes.
Provides acceptable latency for interactive data browsing.
Sub-queries are not supported in Hive
Latency for Apache Hive queries is generally very high.
It is not good for online transaction processing.
Outer joins are not supported.
Supports overwriting or apprehending data but not updates and deletes.

Hive Architecture and its Components

Hive Architecture can be categorized into the following components.

1. Hive Clients

Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice. The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:

1. Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift.

2. JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.

3. ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example, JDBC driver, ODBC uses Thrift to communicate with the Hive server.

2. Hive Services

Hive provides various services like web Interface, CLI etc. to perform queries.

1. Hive CLI (Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.

2. Apache Hive Web Interfaces – Apart from the command line interface, hive also provides a web based GUI for executing Hive queries and commands.

3. Hive Server – Hive server is built on Apache Thrift and therefore, is also referred as Thrift Server that allows different clients to submit requests to Hive and retrieve the final result.

4. Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.

Complier – After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.

5. Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units:

A service that provides metastore access to other Apache Hive services.
Disk storage for the Hive metadata which is separate from HDFS storage.

3. Processing framework and Resource Management

Hive internally uses Hadoop MapReduce framework to execute the queries.

4. Distributed Storage

Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage

How to process data with Apache Hive?

User Interface (UI) calls the execute interface to the Driver.
The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan.
The compiler needs the metadata. So, it sends a request for getMetaData. Thus, receives the sendMetaData request from Metastore.
Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations.
Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.

Architecture