Prerequisites
Before installing HBase or Hive:
- Java 8 or 11 installed (
$JAVA_HOME
configured) - Hadoop installed and configured (HDFS working)
start-dfs.sh
runs successfully- Linux environment (Ubuntu/Debian preferred)
- Basic familiarity with HDFS commands and XML config files
Optional for Hive:
- MySQL server installed and running
- JDBC connector for MySQL placed in Hive’s
lib/
directory
HBase Core Concepts
- HBase is a distributed, column-oriented NoSQL database built on HDFS.
- Data is organized as:
Tables → Rows → Column Families → Columns → Cells (timestamped values) - HBase uses Zookeeper for coordination.
- Tables must be created with at least one column family.
- Supports random, real-time read/write access to large-scale data.
HBase Installation (Standalone / Pseudo-Distributed)
1. Download and Configure
wget https://archive.apache.org/dist/hbase/1.4.13/hbase-1.4.13-bin.tar.gz
tar xvf hbase-1.4.13-bin.tar.gz
mv hbase-1.4.13 ~/hbase
2. .bashrc
or hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HBASE_HOME=~/hbase
export PATH=$PATH:$HBASE_HOME/bin
export CLASSPATH=$CLASSPATH:$HBASE_HOME/lib/*
export HBASE_MANAGES_ZK=true
3. hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase_data</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
4. Run HDFS and HBase
start-dfs.sh
start-hbase.sh
HBase Shell Basics
hbase shell
create 'test', 'cf'
put 'test', 'row1', 'cf:a', 'value1'
get 'test', 'row1'
scan 'test'
disable 'test'
drop 'test'
Hive Core Concepts
-
Hive is a data warehouse built on Hadoop, with SQL-like interface (HiveQL).
-
Executes queries as MapReduce or Spark jobs.
-
Uses a metastore (backed by MySQL, Derby, etc.) to store metadata.
-
Best for batch-oriented data analysis.
Hive Installation
1. Download and Extract
wget https://archive.apache.org/dist/hive/hive-2.3.9/apache-hive-2.3.9-bin.tar.gz
tar -xzf apache-hive-2.3.9-bin.tar.gz
mv apache-hive-2.3.9 ~/hive
2. .bashrc
or environment
export HIVE_HOME=~/hive
export PATH=$HIVE_HOME/bin:$PATH
export HADOOP_USER_CLASSPATH_FIRST=true
3. Setup HDFS directories
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /tmp /user/hive/warehouse
Hive Metastore with MySQL
sudo apt install mysql-server
sudo systemctl start mysql
-- Run in mysql shell
ALTER USER 'root'@'localhost' IDENTIFIED WITH mysql_native_password BY 'password';
Add JDBC Connector
wget https://downloads.mysql.com/archives/get/p/3/file/mysql-connector-java-5.1.48.tar.gz
tar -xvzf mysql-connector-java-5.1.48.tar.gz
cp mysql-connector-java-5.1.48/mysql-connector-java-5.1.48.jar ~/hive/lib
Configure hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
Initialize the Metastore
schematool -initSchema -dbType mysql
HiveQL Commands
CREATE TABLE demo1 (id INT, name STRING);
INSERT INTO demo1 VALUES (1, 'joy');
SELECT * FROM demo1;
DROP TABLE demo1;
Load Data from Local or HDFS
CREATE TABLE demo2 (name STRING);
LOAD DATA INPATH '/demo.txt' INTO TABLE demo2;
Hive Table Types
- Managed Table: Data is stored in Hive’s warehouse directory. Dropping table deletes data.
- External Table: Data is managed externally (e.g., already on HDFS). Hive only manages metadata.
CREATE EXTERNAL TABLE logs (
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION 'hdfs://localhost:9000/data/logs';
Partitioning & Bucketing
Partitioned Table
CREATE TABLE logs (
id INT,
msg STRING
)
PARTITIONED BY (date STRING);
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
Bucketed Table
CREATE TABLE users (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 4 BUCKETS;
SET hive.enforce.bucketing=true;