fbpx
Select Page

Apache Pig and Hive Installation Single Node Machine

The Apache Hadoop software library is a framework that allows the data distributed processing across clusters for computing using simple programming models called Map Reduce. It uses cluster of machines to offer local computation and storage inefficient way.

Hadoop solutions normally include clusters that are hard to manage and maintain. In many scenarios, it requires integration with other tools like MySQL, mahout, etc. It works in a series of map-reduce jobs and each of these jobs are high-latency and depend with each other. So no job can start until the previous job has been finished and successfully completed.

Hadoop admin is responsible for implementation and maintenance of Hadoop atmosphere. Hadoop admins itself might be a title that covers all inside the big data world. Hadoop administrator could, in addition, be concerned about performing arts DBA like tasks with different databases and warehouses like HBase and Hive databases, security administration and cluster administration. It does deploy, manage, monitor, and secure full Hadoop Cluster.

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

Map Reduce:

Map Reduce is a programming model of Hadoop for processing HDFS data. Apache Hadoop can run MapReduce programs written in different languages like Java, Ruby, and Python. MapReduce programs execute in parallel in cluster efficiently. It works in the following phases:

  1. Map phase
  2. Reduce phase

Tools in Hadoop:

HDFS (Hadoop Distributed File System) basic storage for Hadoop.

Apache Pig is an ETL (Extract Transform and Load) tool.

Map Reduce is a programmatic model engine to execute MR jobs.

Apache Hive is a Data Warehouse tool used to work on Historical data using HQL.

Apache Sqoop is a tool for Import and export data from RDBMS to HDFS and

Vice-Versa.

Apache Ooozie is a tool for Job scheduling to control applications over the cluster.

Apache HBase is a NoSQL database based on CAP (Consistency Automaticity Partition)

theory.

Master-slave architecture

1.x daemons/services  5 daemons

Name Node – is the master node. To hold metadata information   

Secondary Name Node – write log and other information about cluster activities

Data Node – slave node where actual data resides.

Job Tracker – receives a request of performing the task and divide it into subtasks. 

Task Tracker – perform tasks using MR on individual DN. Signal system –> 

2.x YARN (Yet another resource negotiator)

6 daemons

Name Node – is the master node. To hold metadata information   

Secondary Name Node – write log and other information about cluster activities

Data Node – slave node where actual data resides.

Resource Manager- recive and run an application on cluster.

Job history server- job status maintain

Node Manager- manages resources and deployment on the node of the cluster. Launch containers where actual jobs will get executed.

Spark is a framework does in-memory computation and works with Hadoop. This framework is based on scala and java language. Input to each phase is key-value pairs in Hadoop. We are going to see some program with steps. In this program, we will see how to create a mapper and reducer class to achieve the objectives. We will also see to submit Hadoop job through the terminal on Hadoop cluster.

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

Let’s see some key-terms used in architecture first:

Physical architecture: Master -Slave architecture

Cluster – Group of computers connected with each other. 

5 Daemons in Hadoop – Services:  Hadoop 1.x series framework

Name node, Secondary name node, Data node, Job tracker, Task tracker

Daemons – Background running processes… Thread running at background

Divide architecture into following:

Storage/HDFS architecture  – Name node, Secondary name node and Data node, Blocks

Master: 

Name node – Master node which handles and manages all meta data info and status control..

Name node as manager …

Secondary name node – Asst. Manager… Helping node to NN which maintains meta data in FS image file system and log generation etc.. It’s not a backup of NN.

Slave:

Data Node – Its a slave node. Where data resides. Data will be stored using block system.

Blocks are memory blocks. Have some size which is modifiable.

Hadoop 1.x Default Block size is 64MB

Hadoop 2.x Default  Block size 128MB

Default replication factor is : 3

Replication factor nothing but how many data copies will get created over cluster.

Ex: 1 Tb data ⇒ 3 TB on cluster

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

Process architecture

Job tracker – Master  → Receives a client request to perform an operation over cluster. Emp.txt 

Select count(*) from emp;

JobID → 00001100 → 

Task Tracker –  Slave → Perform a task using MR phases. → heartbeats will be given to JT.

MR2.x the Job Tracker is divided into three services:

Resource Manager- Is a persistent YARN service that receives and runs applications on the cluster. A MapReduce Job is an application.

JobHistoryServer- To provide information about jobs and completions.

Application Master- To manage each MR job and terminate when its completed.

Task Tracker is replaced with Node Manager, that manages resources and deployment on a node. Its also responsible for launching containers that is of MR Task.

Speculative Execution Mechanism -> If a task tracker is fail or not working then JT will assign same task to other available TT.

Rack awareness algorithm – HDFS file distribution over network in blocks.

daWe have now Apache Pig and Hive both tools are abstraction over map reduce and are used for ETL(Extract Transform and Load) and Data Warehousing. Let see now how to install them on a single node machine.  

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

  1. Go to https://pig.apache.org/releases.html

OR Direct download from the mirror: 

http://mirrors.fibergrid.in/apache/pig/pig-0.17.0/

Will download pig-0.17.0.tar.gz

  1. Untar/Extract this file in hadoop installation directory

In our case at location > /home/sachin/hadoop-2.7.7/pig-0.17.0

  1. Now we need to configure pig. We need to edit “.bashrc” file for pig entries now. To edit this file execute below command:

> sudo gedit ~/.bashrc

And in this file we need to add the following:

#Pig Setting

export PATH=$PATH:/home/sachin/hadoop-2.7.7/pig-0.17.0/bin

export PIG_HOME=/home/sachin/hadoop-2.7.7/pig-0.17.0

export PIG_CLASSPATH=$HADOOP_PREFIX/conf

  1. Terminal > pig -version

Output: Apache Pig version 0.17.0 (r1797386) 

compiled Jun 02 2017, 15:41:58

IF ERROR: java home not set

Then

Terminal > sudo gedit ~/.bashrc and add this line
export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre/

  1. Start Pig: Local Mode: Terminal > pig -x local 

 Map Reduce Mode : Terminal > pig     OR pig -x mapreduce
——————————————————————————————

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

Hive Installation:

  1. Download Hive from : https://hive.apache.org/downloads.html –> Download release now 

  Directly from : https://www-eu.apache.org/dist/hive/hive-2.3.4/

  1. Unzip it into our hadoop directory location in /home/sachin/hadoop-2.7.7
  • Edit the “.bashrc” file to update the environment variables for user.

Terminal > sudo gedit ~/.bashrc

Add following at the end ..

#Hive Setting

export PATH=$PATH:/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/bin

export HIVE_HOME=/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin

export HIVE_CLASSPATH=$HADOOP_PREFIX/conf

  1. Check the Version: Terminal > hive –version

 

  1. Create Hive directories within HDFS. The directory ‘warehouse’ is the location to store the table or data related to hive. Before this Make sure all services are running…

Terminal > hdfs dfs -mkdir -p /user/hive/warehouse

Terminal > hdfs dfs -mkdir -p /tmp

Terminal > hdfs dfs -chmod g+w /user/hive/warehouse

Terminal > hdfs dfs -chmod g+w /tmp

  1. Set Hadoop path in hive-env.sh

Terminal > cd /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf

> sudo cp hive-env.sh.template hive-env.sh

> sudo gedit hive-env.sh

 

Add following lines at the end..

export HADOOP_HEAPSIZE=512

export HADOOP_HOME=/home/sachin/hadoop-2.7.7

export HIVE_CONF_DIR=/home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf

  • Set hive-site.xml for Configuring metastore

** Before following step make sure we have installed MySQL. We are not gonna use Derby as its does not support multi sessions.

OR Install it Terminal > sudo apt-get update

    > sudo apt-get install mysql-server

If Error: Could not get lock on apt/dpkg

Terminal > ps -aux | grep apt

Terminal > sudo kill -9 processId’s

Terminal > cd /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/conf

> sudo cp hive-default.xml.template hive-site.xml

In Real time is MySQl as Metastore, it supports multi users

MySQL Work: Goto your mysql 

>sudo mysql -u root -p

>create database hiveMetaStore; //any name 

>use hiveMetaStore;

>SOURCE /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/scripts/metastore/upgrade/mysql/hive-schema-2.3.0.mysql.sql;          //this is as per version installed of hive

>create user ‘hiveuser’@’%’ identified by ‘hive123’;

>GRANT all on *.* to ‘hiveuser’ @localhost identified by ‘hive123’;

>flush privileges;

>Show tables;
——————————————————————————————

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

In Multi Node with Sqoop did below as was getting an error:

Terminal > sudo gedit /etc/mysql/mysql.conf.d/mysqld.cnf

Replace and add below master address:

bind-address = 192.168.60.176

Save file.

Terminal > sudo service mysql restart

Terminal > mysql -u root -p

mysql > GRANT ALL ON *.* to [email protected]’%’ IDENTIFIED BY ‘root’;

mysql > GRANT ALL ON *.* to [email protected]’%’ IDENTIFIED BY ‘hiveuser123’;
——————————————————————————————

Now in hive-site-.xml:

Terminal > sudo cp hive-default.xml.template hive-site.xml

Terminal > sudo gedit hive-site.xml

Add these lines: Note: Remove/Replace Existing Derby properties with below properties.

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <value>jdbc:mysql://localhost/hiveMetaStore?createDatabaseIfNotExist=true</value>

  <description>JDBC connect string for a JDBC metastore</description>

</property>

<property>

  <name>javax.jdo.option.ConnectionDriverName</name>

  <value>com.mysql.jdbc.Driver</value>

  <description>Driver class name for a JDBC metastore</description>

</property>

<property>

  <name>javax.jdo.option.ConnectionUserName</name>

  <value>hiveuser</value>

  <description>Password for connecting to mysql server</description>

</property>

<property>

  <name>javax.jdo.option.ConnectionPassword</name>

  <value>hive123</value>

  <description>Password for connecting to mysql server</description>

</property>

<property>

   <name>hive.querylog.location</name>

   <value>/tmp/hive</value>

   <description>Location of Hive run time structured log file</description>

</property>

<property>

   <name>hive.exec.local.scratchdir</name>

   <value>/tmp/hive</value>

   <description>Local scratch space for Hive jobs</description>

</property>

<property>

   <name>hive.downloaded.resources.dir</name>

   <value>/tmp/hive</value>

   <description>Temporary local directory for added resources in the remote file system.</description>

</property>

For Free, Demo classes Call: 8605110150
Registration Link: Click Here!

Last Step: Terminal > sudo cp /home/sachin/Desktop/mysql-connector-java-5.1.22-bin.jar /home/sachin/hadoop-2.7.7/apache-hive-2.3.4-bin/lib 

Check the permission of connector after copy. Suppose to be open and not locked.

i.e. mysql-connector-java-5.1.28-bin.jar OR 5.1.22 else we get an Error code1 Fail

Terminal > hive

Terminal > show databases;

Author:

Mr. Sachin Patil (Hadoop Trainer and Coordinator Exp: 12+ years)

At Sevenmentor Pvt. Ltd.

Call the Trainer and Book your free demo Class now!!!

call icon

© Copyright 2019 | Sevenmentor Pvt Ltd.

 

Pin It on Pinterest

× How can I help you?