Introduction:-
Hadoop is a popular open-source framework used for distributed storage and processing of large datasets. With its powerful capabilities, Hadoop is used by many organizations and data professionals for big data analytics and processing
In this blog, we will be discussing the essential Hadoop commands and their usage. Whether you're new to Hadoop or a seasoned expert, this blog will provide you with a comprehensive understanding of Hadoop commands and their applications.
We will cover a range of topics, including Hadoop file system commands, job submission commands, and configuration commands. You'll learn how to manage files and directories in Hadoop, submit jobs for MapReduce processing, and configure Hadoop settings.
How hadoop commands interact with HDFS:-
Hadoop Distributed File System (HDFS) commands are used to interact with the HDFS file system. When you run an HDFS command, it communicates with the NameNode to perform the requested action. The NameNode is responsible for managing the file system namespace, maintaining the metadata about files and directories, and tracking the location of data blocks in the cluster.
Once the NameNode receives the request from the user through the HDFS command, it identifies the relevant DataNodes that hold the data blocks for the requested file or directory. The NameNode then sends the instructions to the appropriate DataNodes to perform the requested action, such as reading or writing data blocks.
So, while the HDFS commands primarily communicate with the NameNode, they also interact with the DataNodes indirectly to read or write data blocks as needed. This makes HDFS a distributed file system, where the data is distributed across multiple DataNodes and managed by the NameNode.
Commands:-
Hadoop File System Commands:-
hadoop fs -ls [path]: List the files and directories in the specified HDFS directory.
hadoop fs -ls -R [path]: Recursively list all the files and directories in the specified HDFS directory and its subdirectories.
hadoop fs -mkdir [path]: Create a new directory in HDFS.
hadoop fs -mkdir -p [path]: Create a new directory in HDFS along with any missing parent directories.
hadoop fs -touchz [path]: Create an empty file in HDFS.
hadoop fs -rm [path]: Delete a file or directory in HDFS.
hadoop fs -rm -r [path]: Delete a directory and its contents in HDFS.
hadoop fs -put [source] [destination]: Copy a file from the local file system to HDFS.
hadoop fs -get [source] [destination]: Copy a file from HDFS to the local file system.
hadoop fs -cat [path]: Display the contents of a file in HDFS.
hadoop fs -tail [path]: Display the last 1KB of a file in HDFS.
hadoop fs -cp [source] [destination]: Copy a file or directory from the source path to the destination path in HDFS.
hadoop fs -mv [source] [destination]: Move a file or directory from the source path to the destination path in HDFS.
hadoop fs -chown [user] [path]: Change the owner of a file or directory in HDFS.
hadoop fs -chmod [mode] [path]: Change the permissions of a file or directory in HDFS.
hadoop fs -du [path]: Display the disk usage of the specified file or directory in HDFS.
hadoop fs -count [path]: Count the number of directories, files, and bytes under the specified path in HDFS.
hadoop fs -setrep [-R] [-w] [replication] [path]: Set the replication factor for a file or directory in HDFS.
hadoop fs -getmerge [src] [localdst]: Merge the files in the specified HDFS directory into a single file on the local file system.
Hadoop Job Submission Commands
hadoop jar [jar_file] [main_class] [args]: Submit a MapReduce job to Hadoop using the specified jar file and main class.
hadoop job -list: List all running jobs in Hadoop.
hadoop job -kill [job_id]: Kill the specified running job in Hadoop.
hadoop job -history [job_id]: View the history of a completed job in Hadoop.
Hadoop fsck command:-
The hadoop fsck
command is used to check the integrity of files and directories stored in Hadoop Distributed File System (HDFS). It provides information about the health of the files and directories, including their block locations, replication factor, and overall health status.
hadoop fsck [path] [-list-corruptfileblocks] [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
path
: The HDFS path to the file or directory to check. If no path is specified,hadoop fsck
will check the entire filesystem.-list-corruptfileblocks
: Lists any corrupted block replicas of files.-move
: Moves corrupted files to a.corrupt
subdirectory in their parent directory.-delete
: Deletes corrupted files.-openforwrite
: Reports the list of files that have under-replicated or over-replicated blocks that are currently being written to, so that the user can manually run the replication commands.-files
: Reports the files and their blocks in the given path. This is the default behavior if no options are specified.-blocks
: Reports the block IDs of each block in each file.-locations
: Reports the locations of each block replica, including the hostnames of the datanodes where they are stored.-racks
: Reports the network topology rack IDs of the datanodes where each block replica is stored.
hadoop fs vs hadoop fsck:-
While the hadoop fs
command provides a way to interact with files and directories in HDFS, it doesn't provide all of the functionality of hadoop fsck
. The hadoop fsck
command is a more specialized tool that is used specifically for checking the health of files and directories in HDFS. The hadoop fsck
command is the proper tool for more comprehensive HDFS health checks.