Prerequisites

Before installing HBase or Hive:

  • Java 8 or 11 installed ($JAVA_HOME configured)
  • Hadoop installed and configured (HDFS working)
  • start-dfs.sh runs successfully
  • Linux environment (Ubuntu/Debian preferred)
  • Basic familiarity with HDFS commands and XML config files

Optional for Hive:

  • MySQL server installed and running
  • JDBC connector for MySQL placed in Hive’s lib/ directory

HBase Core Concepts

  • HBase is a distributed, column-oriented NoSQL database built on HDFS.
  • Data is organized as:
    TablesRowsColumn FamiliesColumnsCells (timestamped values)
  • HBase uses Zookeeper for coordination.
  • Tables must be created with at least one column family.
  • Supports random, real-time read/write access to large-scale data.

HBase Installation (Standalone / Pseudo-Distributed)

1. Download and Configure

wget https://archive.apache.org/dist/hbase/1.4.13/hbase-1.4.13-bin.tar.gz
tar xvf hbase-1.4.13-bin.tar.gz
mv hbase-1.4.13 ~/hbase

2. .bashrc or hbase-env.sh

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HBASE_HOME=~/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:$HBASE_HOME/lib/*
export HBASE_MANAGES_ZK=true

3. hbase-site.xml

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://localhost:9000/hbase_data</value>
  </property>
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/home/hadoop/zookeeper</value>
  </property>
</configuration>

4. Run HDFS and HBase

start-dfs.sh
start-hbase.sh

HBase Shell Basics

hbase shell
create 'test', 'cf'
put 'test', 'row1', 'cf:a', 'value1'
get 'test', 'row1'
scan 'test'
disable 'test'
drop 'test'

Hive Core Concepts

  • Hive is a data warehouse built on Hadoop, with SQL-like interface (HiveQL).

  • Executes queries as MapReduce or Spark jobs.

  • Uses a metastore (backed by MySQL, Derby, etc.) to store metadata.

  • Best for batch-oriented data analysis.


Hive Installation

1. Download and Extract

wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
tar -xzf apache-hive-2.3.9-bin.tar.gz
mv apache-hive-2.3.9 ~/hive

2. .bashrc or environment

export HIVE_HOME=~/hive
export PATH=$HIVE_HOME/bin:$PATH
export HADOOP_USER_CLASSPATH_FIRST=true

3. Setup HDFS directories

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp /user/hive/warehouse

Hive Metastore with MySQL

sudo apt install mysql-server
sudo systemctl start mysql
-- Run in mysql shell
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password';

Add JDBC Connector

wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.48.tar.gz
tar -xvzf mysql-connector-java-5.1.48.tar.gz
cp mysql-connector-java-5.1.48/mysql-connector-java-5.1.48.jar ~/hive/lib

Configure hive-site.xml

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

Initialize the Metastore

schematool -initSchema -dbType mysql

HiveQL Commands

CREATE TABLE demo1 (id INT, name STRING);
INSERT INTO demo1 VALUES (1, 'joy');
SELECT * FROM demo1;
DROP TABLE demo1;

Load Data from Local or HDFS

CREATE TABLE demo2 (name STRING);
LOAD DATA INPATH '/demo.txt' INTO TABLE demo2;

Hive Table Types

  • Managed Table: Data is stored in Hive’s warehouse directory. Dropping table deletes data.
  • External Table: Data is managed externally (e.g., already on HDFS). Hive only manages metadata.
CREATE EXTERNAL TABLE logs (
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 'hdfs://localhost:9000/data/logs';

Partitioning & Bucketing

Partitioned Table

CREATE TABLE logs (
  id INT,
  msg STRING
)
PARTITIONED BY (date STRING);
 
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

Bucketed Table

CREATE TABLE users (
  id INT,
  name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;
 
SET hive.enforce.bucketing=true;