Hadoop Core Concepts
HDFS Architecture
- NameNode: Central metadata server, manages file system namespace and block locations.
- DataNode: Stores actual data blocks, responds to NameNode for block operations.
Key Config Files
core-site.xml
: General settings (e.g., default FS URI).hdfs-site.xml
: HDFS-specific configs like replication factor, block directories.
1. Prerequisites
- OS: Ubuntu 20.04+
- Java: OpenJDK 11+
sudo apt update && sudo apt install -y openjdk-11-jdk ssh rsync
Add to ~/.bashrc
:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
source ~/.bashrc
2. Install Hadoop
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xzf hadoop-3.3.6.tar.gz
mv hadoop-3.3.6 ~/hadoop
Add to ~/.bashrc
:
export HADOOP_HOME=~/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
source ~/.bashrc
3. Configure Hadoop
core-site.xml
:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/<your-user>/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/<your-user>/hadoop_tmp/hdfs/datanode</value>
</property>
</configuration>
hadoop-env.sh
:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
4. Run HDFS
hdfs namenode -format
start-dfs.sh
Verify:
jps
Should show: NameNode
, DataNode
, SecondaryNameNode
HDFS Web UI: http://localhost:9870
HDFS Commands
Basic Operations
hdfs dfs -ls / # List files
hdfs dfs -mkdir /dir # Create directory
hdfs dfs -put local.txt / # Upload file
hdfs dfs -get /file.txt . # Download file
hdfs dfs -rm /file.txt # Delete file
hdfs dfs -copyFromLocal f / # Alternate upload
File Properties
- Replication: Controlled by
dfs.replication
(default: 3; local: use 1) - Block Size: Default 128 MB (can be changed via
dfs.block.size
) - Safe Mode: Read-only state for HDFS maintenance
hdfs dfsadmin -safemode get|enter|leave
5. Stop HDFS
stop-dfs.sh