Hadoop Core Concepts

HDFS Architecture

  • NameNode: Central metadata server, manages file system namespace and block locations.
  • DataNode: Stores actual data blocks, responds to NameNode for block operations.

Key Config Files

  • core-site.xml: General settings (e.g., default FS URI).
  • hdfs-site.xml: HDFS-specific configs like replication factor, block directories.

1. Prerequisites

  • OS: Ubuntu 20.04+
  • Java: OpenJDK 11+
sudo apt update && sudo apt install -y openjdk-11-jdk ssh rsync

Add to ~/.bashrc:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
source ~/.bashrc

2. Install Hadoop

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 ~/hadoop

Add to ~/.bashrc:

export HADOOP_HOME=~/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
source ~/.bashrc

3. Configure Hadoop

core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

hdfs-site.xml:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/<your-user>/hadoop_tmp/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/<your-user>/hadoop_tmp/hdfs/datanode</value>
  </property>
</configuration>

hadoop-env.sh:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

4. Run HDFS

hdfs namenode -format
start-dfs.sh

Verify:

jps

Should show: NameNode, DataNode, SecondaryNameNode

HDFS Web UI: http://localhost:9870


HDFS Commands

Basic Operations

hdfs dfs -ls /               # List files
hdfs dfs -mkdir /dir         # Create directory
hdfs dfs -put local.txt /    # Upload file
hdfs dfs -get /file.txt .    # Download file
hdfs dfs -rm /file.txt       # Delete file
hdfs dfs -copyFromLocal f /  # Alternate upload

File Properties

  • Replication: Controlled by dfs.replication (default: 3; local: use 1)
  • Block Size: Default 128 MB (can be changed via dfs.block.size)
  • Safe Mode: Read-only state for HDFS maintenance
hdfs dfsadmin -safemode get|enter|leave

5. Stop HDFS

stop-dfs.sh