Apache Hive Hands-on

Avinash Navlani
7 min readNov 22, 2022

In this tutorial, we will focus on Hadoop Hive for processing big data.

What is Hive?

Hive is a component in Hadoop Stack. It is an open-source data warehouse tool that runs on top of Hadoop. It was developed by Facebook and later it is donated to the Apache foundation. It reads, writes, and manages big data tables stored in HDFS or other data sources.

Hive doesn't offer insert, delete and update operations but it is used to perform analytics, mining, and report generation on the large data warehouse. Hive uses Hive query language similar to SQL. Most of the syntax is similar to the MySQL database. It is used for OLAP (Online Analytical Processing) purposes.

Why we need Hive?

In the year 2006, Facebook was generating 10 GB of data per day and in 2007 its data increased by 1 TB per day. After few days, it is generating 15 TB of data per day. Initially, Facebook is using the Scribe server, Oracle database, and Python scripts for processing large data sets. As Facebook started gathering data then they shifted to Hadoop as its key tool for data analysis and processing.

Facebook is using Hadoop for managing its big data and facing problems for ETL operations because for each small operation they need to write the Java programs. They need a lot of Java resources that…

--

--

Avinash Navlani

Sr Data Scientist| Analytics Consulting | Data Science Communicator | Helping Clients to Improve Products & Services with Data