R & hadoop: An Introduction to Apache Hive

Wednesday, 17 December 2014

An Introduction to Apache Hive

Hive is an important member of hadoop ecosystem. It runs on top of hadoop. Hive uses a SQL type query language to process the data in hdfs. Hive is very simple as compared to writing several lines of mapreduce codes using programming languages such as Java. Hive was developed by facebook in a vision to support their SQL experts to handle big data without much difficulty. Hive queries are easy to learn for people who don’t know any programming languages. People having experience in SQL can go straight forward with hive queries. The queries fired into hive will ultimately run as mapreduce.
Hive runs in two execution modes, local and distributed mode.
In local, the hive queries run as a single process and uses the local file system. In distributed mode, the mapper and reducer runs as different process and uses the hadoop distributed file system.
The installation of hive was explained well in my previous post Hive Installation.
Hive stores its contents in hdfs and table details (metadata) in some databases. By default the metadata is stored in derby database, but this is just for play around setups only and cannot be used for multiuser environments. For multiuser environments, we can use databases such as mysql, postgresql , oracle etc for storing the hive metadata. The data are stored in hdfs and it is contained in a location called hive warehouse directory which is defined by the property hive.metastore.warehouse.dir. By default this will be /user/hive/warehouse
We can fire queries into hive using a command line interface or using clients written in different programming languages. Hive server exposes a thrift service making hive accessible from various programming languages .
The simplicity and power of hive can be explained by comparing the word count program written in java program and in hive query.
The word count program written in java is well explained in my previous post A Simple Mapreduce Program – Wordcount . For that have to write a lot of lines of code and it will take time and it needs some good programming knowledge also.
The same word count can be done using hive query in a few lines of hive query.

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) word
GROUP BY word
ORDER BY word;

3 comments:

Add comment