HDFS Guide (File System Shell) Commands
The Hadoop File System is a distributed file system that is the heart of the storage for Hadoop. There are many ways to interact with HDFS including Ambari Views, HDFS Web UI, WebHDFS and the command line. The first way most people interact with HDFS is via the command line tool called hdfs. This is a runner that runs other commands including dfs. This replaces the old Hadoop fs in the newer Hadoop. This guide is for Hadoop 2.7.3 and newer including HDP 2.5. The HDFS client can be installed on Linux, Windows, and Macintosh and be utilized to access your remote or local Hadoop clusters. The easiest way to install is onto a jump box using Ambari to install the Hadoop client. I also recommend installing all the clients it recommends including Pig and Hive. There is a detailed list of every command and option for each version of Hadoop.
Every day I am looking at different Hadoop clusters of various sizes and there will be various tools for interacting while HDFS files either via web, UI, tools, IDEs, SQL and more. The one universal and fastest way to check things is with the shell or CLI. The following are always helpful and usually hard or slower to do in a graphical interface.
The first command I type every single day is to get a list of directories from the root. This gives you the lay of the land.
To List All the Files in the HDFS Root Directory
You can choose any path from the root down, just like regular Linux file system. -h shows in human readible sizes, recommended. -R is another great one to drill into subdirectories. Often you won't realize how many files and directories you actually have in HDFS. Many tools including Hive, Spark history and BI tools will create directories and files as logs or for indexing.
Create an empty file in an HDFS Directory
This works the same as Linux Touch command. This is useful to initialize a file. Sometimes you want to test a user's permissions and want to quickly do a write. This is the quickest path for you. You can also bulk upload a chunk of files via: hdfs dfs -put *.txt /test1/ The reason I want to do this so I can show you a very interesting command called getmerge.
Concatenate all the files into a directory into a single file
This will create a new file on your local directory that contains all the files from a directory and concatenates all them together. The -nl option adds newlines between files. This is often nice when you wish to consolidate a lot of small files into an extract for another system. This is quick and easy and doesn't require using a tool like Apache Flume or Apache NiFi. Of course, for regular production jobs and for larger and greater number of files you will want a more powerful tool like the two mentioned. For a quick extract that someone wants to see in Excel, concatenating a few dozen CSVs from a directory into one file is helpful.
Change the Permissions of a /new-dir
The chmod patterns follow the standard Linux patterns, where 777 gives every user read-write-execute for user-group-other.
Change the Owner and Group of a New Directory: /new-dir
Change the ownership of a directory to the admin user and the Hadoop group. You must have permissions to give this to that user and that group. Also, the user and group must exist. For changing permissions, it is best to sudo to the hdfs user which is the root user for HDFS. Linux root user is not the root owner of the HDFS file system.
Delete all the ORC files forever, skipping the temporary trash holding.
We want to skipTrash to destroy that file immediately and free up our space, otherwise, it will go to a trash directory and wait for a configured period of time before it was deleted. I use -f to force the deletion. I want these files gone!
Move A Directory From Local To HDFS and Delete Local
If you want to move a local directory up to HDFS and remove the local copy, the command is moveFromLocal.
Show Disk Usage in Megabytes for the Directory: /dir
The -h gives you a human readble output of size, for example Gigabytes.
When in doubt of what command you want to use or what to do next, just type help. You will also get a detailed list for each individual command.
You can also use the older format of: hadoop fs. This will work on older Hadoop installations as well.
Since you are logged in as the hdfs super user, you can also use the HDFS Admin commands.
HDFS DFS Administration Overview
There are number of commands that you may need to use for administrating your cluster if you are one of the administrators for your cluster. If you are running your own personal cluster or Sandbox, these are also good to know and try. Do Not Try These In Production if you are not the owner and fully understand the dire consequences of these actions. These commands will be affecting the entire Hadoop cluster distributed file system. You can shutdown data nodes, add quotas to directories for various users and other administrative features.
Do not do this unless you need to do cluster maintenance such as adding nodes. You will be entering read-only mode. You need to do safemode leave to get out of this. These commands may take time as they wait for things to write and jobs not accessing the servers.
For additional administration commands, see the references below. The above list of commands will help you with most uses and analysis you will need to do.
Resources: