Hadoop is a distributed data processing frameworks used for big data analytics.
Hadoop: Hadoop is based on the MapReduce processing model. It processes data in two phases: Map phase, where data is split into key-value pairs and processed in parallel across multiple nodes, and Reduce phase, where the results from the Map phase are aggregated and combined to produce the final output.
Spark: Spark also supports the MapReduce model, but it introduces the concept of Resilient Distributed Datasets (RDDs). RDDs are immutable distributed collections of objects that can be processed in parallel. Spark allows for in-memory data processing, reducing the need for costly disk I/O between stages, which can make it significantly faster than Hadoop's MapReduce for certain use cases.
The working of Hadoop Streaming
Mapper Phase:
Hadoop reads data from input files and splits them into fixed-size blocks called InputSplits.
For each InputSplit, Hadoop launches an instance of the external program (e.g., a Python script) as the Mapper. The external Mapper reads data from stdin, processes it, and writes intermediate key-value pairs to stdout.
Shuffling and Sorting:
Hadoop collects the intermediate key-value pairs from all Mappers and performs a shuffling and sorting phase. The output of the Mapper is sorted by keys so that all values for the same key are grouped together.
Reducer Phase:
For each unique key, Hadoop launches an instance of the external program as the Reducer.
The external Reducer reads data from stdin, processes the values associated with the key, and writes the final key-value pairs to stdout.
Output:
The final output from the Reducers is collected and stored in the Hadoop Distributed File System (HDFS) or other output destinations.
To install Hadoop on Ubuntu and start all Hadoop daemons, follow these steps:
1. Update the system: Open a terminal and run the following command to update the package list and upgrade installed packages:
sudo apt update && sudo apt upgrade -y
2. Install Java: Hadoop requires Java to run. Install OpenJDK using the following command
sudo apt install openjdk-8-jdk
The following additional packages will be installed:
openjdk-8-jdk-headless openjdk-8-jre openjdk-8-jre-headless
...
...
...
download the latest stable version of Hadoop is 3.3.6 (730 MB file.)
4. tar xfz hadoop-3.3.6-aarch64.tar.gz
5. sudo mv hadoop-3.3.6 /usr/local/hadoop
6. insert the following environmental variables in .bashrc and source it
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
7. cd /usr/local/hadoop/etc/hadoop
8. Add the following environmental variables in hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
9. start-all.sh
10. After successful installation, when you type
jps
you should see the following processes running
141399 ResourceManager
143044 NodeManager
184580 DataNode
184820 SecondaryNameNode
189725 NameNode
10.
To run Python word count application using Hadoop streaming , follow these steps:
0. In the folder /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE, see files
mapper.py, reducer.py, word_count_data.txt
chmod 777 mapper.py reducer.py
1. Before making a run remove the output folder from HDFS
hdfs dfs -rmdir /word_count_in_python/output
2. Run Hadoop streaming for the Python word_count application
hadoop jar /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/hadoop- streaming-2.7.3.jar -input /word_count_in_python/word_count_data.txt -output /word_count_in_python/output -mapper /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/mapper.py -reducer /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.pyERROR namenode.SecondaryNameNode: Failed to start secondary namenode
java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority.
ChatGPT
The error message you encountered indicates that there is an issue with the configuration of the NameNode address (fs.defaultFS) in Hadoop. Specifically, it appears that the fs.defaultFS property is set to a file URI (file:///), which is incorrect for the NameNode address.
To resolve this error, you need to ensure that the fs.defaultFS property is correctly set to the appropriate HDFS URI that points to the active NameNode in your Hadoop configuration.
Here's how you can fix it:
Open the core-site.xml configuration file:
Navigate to the Hadoop configuration directory ($HADOOP_HOME/etc/hadoop) and open the core-site.xml file using a text editor:
bash
Copy code
cd $HADOOP_HOME/etc/hadoop
nano core-site.xml
Check and modify the fs.defaultFS property:
Look for the fs.defaultFS property within the core-site.xml file. It should be set to the HDFS URI of your active NameNode. For a pseudo-distributed setup on a single node, it would typically be:
xml
Copy code
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Replace localhost with the hostname or IP address of your NameNode if you are running a distributed setup. The default port for the NameNode is 9000.
Save the changes and close the file.
Retry starting the secondary namenode:
After updating the core-site.xml file, try starting the secondary namenode again:
bash
Copy code
hdfs secondarynamenode
If everything is correctly configured, the secondary namenode should start without any errors.
3. Check whether files are written in HDFS
hdfs dfs -ls /word_count_in_python/output
4. display the output file from HDFS
hdfs dfs -cat /word_count_in_python/output/part-00000
5.Other HDFS commands: To transfer files from Unix filesystem to HDFS filesystem
hdfs dfs -put ~/example.txt /word_count_in_python/output4
Trouble Shooting
Various logs provided by Hadoop are rich sources of information for troubleshooting. They are kept in /usr/local/hadoop/logs
TS1:
JPS showed NameNode was not running
netstat -tuln | grep 9000 , returned empty
Check for NameNode Formatting: If this is the first time setting up Hadoop or if you suspect that the NameNode metadata might be corrupted, you can try formatting the NameNode using the following command (This will delete all HDFS data, so use with caution):
hdfs namenode -format
After formatting, try starting Hadoop again using start-dfs.sh.
start-dfs.sh
Starting namenodes on [dhruv]
netstat -tuln | grep 9000
tcp 0 0 192.168.0.102:9000 0.0.0.0:* LISTEN
Now the jps command lists all the running Hadoop processes
141399 ResourceManager
143044 NodeManager
184580 DataNode
184820 SecondaryNameNode
189725 NameNode
TS2:
If you get connection refused problem, refer
https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused
TS2
ERROR
ERROR namenode.SecondaryNameNode: Failed to start secondary namenode java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.defaultFS): file:/// has no authority.
Solution
Check and modify the fs.defaultFS property: Look for the fs.defaultFS property within the core-site.xml file. It should be set to the HDFS URI of your active NameNode. For a pseudo-distributed setup on a single node, it would typically be:
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
Replace localhost with the hostname or IP address of your NameNode if you are running a distributed setup. The default port for the NameNode is 9000.
Retry starting the secondary namenode: After updating the core-site.xml file, try starting the secondary namenode again:
hdfs secondarynamenode
TS4
ERROR
Call From dhruv/192.168.0.102 to dhruv:9000 failed on connection exception: java.net.ConnectException: Connection refused;
Solution
TS6
Problem
Hadoop: the jps command doesn't show Namenode. Only 141399 ResourceManager 143044 NodeManager 184580 DataNode 184820 SecondaryNameNode are running
Solution
Start Hadoop NameNode: If the NameNode is not running, you can manually start it using the following command:
start-dfs.sh
Further if problem persists
Check for NameNode Formatting: If this is the first time setting up Hadoop or if you suspect that the NameNode metadata might be corrupted, you can try formatting the NameNode using the following command (This will delete all HDFS data, so use with caution):
hdfs namenode -format
After formatting, try starting Hadoop again using start-dfs.sh.
TS7
Error
Starting secondary namenodes [dhruv] dhruv: ERROR: Cannot set priority of secondarynamenode process 181356
Solution:
The error "Cannot set priority of secondarynamenode process" during the startup of the secondary namenode in Hadoop indicates that the process is unable to set its priority using the nice command due to insufficient permissions. Like the previous cases, this issue is usually caused by incorrect configurations or missing permissions for the user running Hadoop.
To resolve this error, follow the steps below:
Ensure you are running Hadoop as a non-root user: Running Hadoop as the root user is not recommended for security reasons. Make sure you are logged in as a regular user with sufficient permissions to execute Hadoop processes.
Grant permissions to set process priorities (using nice): You can grant the necessary permissions by adding the user to the sudo group and modifying the sudoers file to allow the specific command without a password prompt. Follow these steps:
sudo visudo
<username> ALL=(ALL) NOPASSWD: /usr/bin/nice
hdfs secondarynamenode
TS8
Cannot run program "/shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.py": error=13, Permission denied
Solution
chmod 777 /shared/TRANZ1/CONTENT-TRAINING-TRANZ1/HADOOP-MAP-REDUCE/reducer1.py References
Hadoop BigData 101 : https://www.youtube.com/watch?v=QNB1SZm2jS4
Comentarios