Upload List of Directories to Hdfs Python

Chapter 1. Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a Java-based distributed, scalable, and portable filesystem designed to span large clusters of commodity servers. The pattern of HDFS is based on GFS, the Google File Organization, which is described in a paper published by Google. Like many other distributed filesystems, HDFS holds a big amount of information and provides transparent access to many clients distributed across a network. Where HDFS excels is in its ability to store very big files in a reliable and scalable mode.

HDFS is designed to shop a lot of information, typically petabytes (for very big files), gigabytes, and terabytes. This is accomplished by using a block-structured filesystem. Individual files are split into fixed-size blocks that are stored on machines across the cluster. Files made of several blocks mostly do non take all of their blocks stored on a single machine.

HDFS ensures reliability by replicating blocks and distributing the replicas across the cluster. The default replication factor is three, meaning that each block exists three times on the cluster. Block-level replication enables data availability even when machines neglect.

This chapter begins past introducing the core concepts of HDFS and explains how to interact with the filesystem using the native built-in commands. Afterward a few examples, a Python customer library is introduced that enables HDFS to be accessed programmatically from inside Python applications.

Overview of HDFS

The architectural design of HDFS is composed of two processes: a process known as the NameNode holds the metadata for the filesystem, and ane or more than DataNode processes store the blocks that brand up the files. The NameNode and DataNode processes can run on a single machine, but HDFS clusters usually consist of a dedicated server running the NameNode process and possibly thousands of machines running the DataNode procedure.

The NameNode is the most important machine in HDFS. It stores metadata for the unabridged filesystem: filenames, file permissions, and the location of each block of each file. To allow fast admission to this data, the NameNode stores the entire metadata structure in retentiveness. The NameNode too tracks the replication cistron of blocks, ensuring that car failures do not result in data loss. Considering the NameNode is a single signal of failure, a secondary NameNode tin can be used to generate snapshots of the primary NameNode's memory structures, thereby reducing the hazard of data loss if the NameNode fails.

The machines that store the blocks within HDFS are referred to as DataNodes. DataNodes are typically commodity machines with large storage capacities. Unlike the NameNode, HDFS will proceed to operate normally if a DataNode fails. When a DataNode fails, the NameNode will replicate the lost blocks to ensure each block meets the minimum replication gene.

The case in Effigy 1-1 illustrates the mapping of files to blocks in the NameNode, and the storage of blocks and their replicas within the DataNodes.

The post-obit section describes how to interact with HDFS using the born commands.

Interacting with HDFS

Interacting with HDFS is primarily performed from the command line using the script named hdfs. The hdfs script has the following usage:

$ hdfs COMMAND [-pick <arg>]

The Command argument instructs which functionality of HDFS will be used. The -option statement is the name of a specific option for the specified command, and <arg> is one or more arguments that that are specified for this option.

Common File Operations

To perform basic file manipulation operations on HDFS, use the dfs command with the hdfs script. The dfs command supports many of the same file operations establish in the Linux vanquish.

It is important to note that the hdfs command runs with the permissions of the system user running the command. The following examples are run from a user named "hduser."

List Directory Contents

To list the contents of a directory in HDFS, utilise the -ls command:

$ hdfs dfs -ls $

Running the -ls command on a new cluster volition not render any results. This is because the -ls command, without any arguments, will attempt to brandish the contents of the user's dwelling directory on HDFS. This is not the same home directory on the host machine (e.g., /home/$USER), but is a directory within HDFS.

Providing -ls with the forward slash (/) as an statement displays the contents of the root of HDFS:

$ hdfs dfs -ls / Found 2 items drwxr-xr-x   - hadoop supergroup    0 2015-09-20 xiv:36 /hadoop drwx------   - hadoop supergroup    0 2015-09-twenty 14:36 /tmp

The output provided by the hdfs dfs command is like to the output on a Unix filesystem. By default, -ls displays the file and folder permissions, owners, and groups. The ii folders displayed in this instance are automatically created when HDFS is formatted. The hadoop user is the proper name of the user under which the Hadoop daemons were started (east.g., NameNode and DataNode), and the supergroup is the name of the group of superusers in HDFS (e.g., hadoop).

Creating a Directory

Home directories within HDFS are stored in /user/$HOME. From the previous example with -ls, it can exist seen that the /user directory does not currently exist. To create the /user directory within HDFS, use the -mkdir command:

$ hdfs dfs -mkdir /user

To make a home directory for the current user, hduser, apply the -mkdir command again:

$ hdfs dfs -mkdir /user/hduser

Use the -ls command to verify that the previous directories were created:

$ hdfs dfs -ls -R /user drwxr-xr-x   - hduser supergroup    0 2015-09-22 18:01 /user/hduser

Copy Information onto HDFS

After a directory has been created for the current user, data can be uploaded to the user'south HDFS home directory with the -put command:

$ hdfs dfs -put /domicile/hduser/input.txt /user/hduser

This command copies the file /home/hduser/input.txt from the local filesystem to /user/hduser/input.txt on HDFS.

Use the -ls control to verify that input.txt was moved to HDFS:

$ hdfs dfs -ls  Found 1 items -rw-r--r--   i hduser supergroup         52 2015-09-20 13:20 input.txt

Retrieving Data from HDFS

Multiple commands allow data to be retrieved from HDFS. To simply view the contents of a file, use the -cat command. -true cat reads a file on HDFS and displays its contents to stdout. The following command uses -true cat to display the contents of /user/hduser/input.txt:

$ hdfs dfs -cat input.txt jack be nimble jack be quick jack jumped over the candlestick

Data can also be copied from HDFS to the local filesystem using the -become command. The -get control is the contrary of the -put command:

$ hdfs dfs -get input.txt /domicile/hduser

This command copies input.txt from /user/hduser on HDFS to /home/hduser on the local filesystem.

HDFS Command Reference

The commands demonstrated in this department are the basic file operations needed to begin using HDFS. Below is a full listing of file manipulation commands possible with hdfs dfs. This listing can as well be displayed from the control line by specifying hdfs dfs without any arguments. To get help with a specific selection, apply either hdfs dfs -usage <option> or hdfs dfs -help <option>.

Usage: hadoop fs [generic options]     [-appendToFile <localsrc> ... <dst>]     [-cat [-ignoreCrc] <src> ...]     [-checksum <src> ...]     [-chgrp [-R] GROUP PATH...]     [-chmod [-R] <Mode[,MODE]... | OCTALMODE> PATH...]     [-chown [-R] [OWNER][:[Group]] PATH...]     [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]     [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]     [-count [-q] [-h] <path> ...]     [-cp [-f] [-p | -p[topax]] <src> ... <dst>]     [-createSnapshot <snapshotDir> [<snapshotName>]]     [-deleteSnapshot <snapshotDir> <snapshotName>]     [-df [-h] [<path> ...]]     [-du [-s] [-h] <path> ...]     [-expunge]     [-find <path> ... <expression> ...]     [-become [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]     [-getfacl [-R] <path>]     [-getfattr [-R] {-north proper noun | -d} [-due east en] <path>]     [-getmerge [-nl] <src> <localdst>]     [-help [cmd ...]]     [-ls [-d] [-h] [-R] [<path> ...]]     [-mkdir [-p] <path> ...]     [-moveFromLocal <localsrc> ... <dst>]     [-moveToLocal <src> <localdst>]     [-mv <src> ... <dst>]     [-put [-f] [-p] [-l] <localsrc> ... <dst>]     [-renameSnapshot <snapshotDir> <oldName> <newName>]     [-rm [-f] [-r|-R] [-skipTrash] <src> ...]     [-rmdir [--ignore-fail-on-not-empty] <dir> ...]     [-setfacl [-R] [{-b|-k} {-m|-ten <acl_spec>} <path>]|[--set <acl_spec> <path>]]     [-setfattr {-n name [-five value] | -10 name} <path>]     [-setrep [-R] [-w] <rep> <path> ...]     [-stat [format] <path> ...]     [-tail [-f] <file>]     [-test -[defsz] <path>]     [-text [-ignoreCrc] <src> ...]     [-touchz <path> ...]     [-truncate [-due west] <length> <path> ...]     [-usage [cmd ...]]  Generic options supported are -conf <configuration file>     specify an application configuration file -D <belongings=value>            apply value for given property -fs <local|namenode:port>      specify a namenode -jt <local|resourcemanager:port>    specify a ResourceManager -files <comma separated list of files>    specify comma separated files to exist copied to the map reduce cluster -libjars <comma separated listing of jars>    specify comma separated jar files to include in the classpath. -archives <comma separated listing of athenaeum>    specify comma separated athenaeum to be unarchived on the compute machines.  The general command line syntax is bin/hadoop command [genericOptions] [commandOptions]

The next section introduces a Python library that allows HDFS to be accessed from within Python applications.

Snakebite

Snakebite is a Python package, created by Spotify, that provides a Python client library, assuasive HDFS to be accessed programmatically from Python applications. The client library uses protobuf messages to communicate directly with the NameNode. The Snakebite package besides includes a command-line interface for HDFS that is based on the client library.

This department describes how to install and configure the Snakebite package. Snakebite'south customer library is explained in item with multiple examples, and Snakebite's built-in CLI is introduced equally a Python alternative to the hdfs dfs command.

Installation

Snakebite requires Python 2 and python-protobuf ii.four.one or higher. Python 3 is currently not supported.

Snakebite is distributed through PyPI and tin exist installed using pip:

$ pip install snakebite

Client Library

The customer library is written in Python, uses protobuf letters, and implements the Hadoop RPC protocol for talking to the NameNode. This enables Python applications to communicate directly with HDFS and non have to make a system call to hdfs dfs.

List Directory Contents

Example ane-1 uses the Snakebite customer library to list the contents of the root directory in HDFS.

Example 1-1. python/HDFS/list_directory.py

                  from                  snakebite.client                  import                  Client                  client                  =                  Customer                  (                  'localhost'                  ,                  9000                  )                  for                  x                  in                  client                  .                  ls                  ([                  '/'                  ]):                  impress                  x

The most important line of this programme, and every program that uses the client library, is the line that creates a client connection to the HDFS NameNode:

                customer                =                Client                (                'localhost'                ,                9000                )

The Client() method accepts the following parameters:

host (string): Hostname or IP address of the NameNode
port (int): RPC port of the NameNode
hadoop_version (int): The Hadoop protocol version to exist used (default: 9)
use_trash (boolean): Use trash when removing files
effective_use (string): Effective user for the HDFS operations (default: None or current user)

The host and port parameters are required and their values are dependent upon the HDFS configuration. The values for these parameters can exist constitute in the hadoop/conf/core-site.xml configuration file under the property fs.defaultFS:

                <property>                <name>fs.defaultFS</proper noun>                <value>hdfs://localhost:9000</value>                </property>

For the examples in this section, the values used for host and port are localhost and 9000, respectively.

After the client connection is created, the HDFS filesystem can be accessed. The rest of the previous application used the ls command to list the contents of the root directory in HDFS:

                for                x                in                client                .                ls                ([                '/'                ]):                print                x

Information technology is of import to notation that many of methods in Snakebite return generators. Therefore they must be consumed to execute. The ls method takes a listing of paths and returns a listing of maps that comprise the file information.

Executing the list_directory .py awarding yields the post-obit results:

$ python list_directory.py  {'group': u'supergroup', 'permission': 448, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1442752574936L, 'length': 0L, 'blocksize': 0L, 'possessor': u'hduser', 'path': '/tmp'} {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1442742056276L, 'length': 0L, 'blocksize': 0L, 'possessor': u'hduser', 'path': '/user'}

Create a Directory

Use the mkdir() method to create directories on HDFS. Example 1-2 creates the directories /foo/bar and /input on HDFS.

Example ane-ii. python/HDFS/mkdir.py

                  from                  snakebite.client                  import                  Customer                  client                  =                  Client                  (                  'localhost'                  ,                  9000                  )                  for                  p                  in                  client                  .                  mkdir                  ([                  '/foo/bar'                  ,                  '/input'                  ],                  create_parent                  =                  True                  ):                  print                  p

Executing the mkdir.py application produces the following results:

$ python mkdir.py  {'path': '/foo/bar', 'result': True} {'path': '/input', 'result': True}

The mkdir() method takes a listing of paths and creates the specified paths in HDFS. This example used the create_parent parameter to ensure that parent directories were created if they did not already exist. Setting create_parent to True is analogous to the mkdir -p Unix command.

Deleting Files and Directories

Deleting files and directories from HDFS can be accomplished with the delete() method. Case one-three recursively deletes the /foo and /bar directories, created in the previous example.

Instance i-three. python/HDFS/delete.py

                  from                  snakebite.client                  import                  Client                  client                  =                  Client                  (                  'localhost'                  ,                  9000                  )                  for                  p                  in                  client                  .                  delete                  ([                  '/foo'                  ,                  '/input'                  ],                  recurse                  =                  True                  ):                  impress                  p

Executing the delete.py application produces the post-obit results:

$ python delete.py  {'path': '/foo', 'result': True} {'path': '/input', 'effect': True}

Performing a recursive delete will delete any subdirectories and files that a directory contains. If a specified path cannot be constitute, the delete method throws a FileNotFoundException. If recurse is not specified and a subdirectory or file exists, DirectoryException is thrown.

The recurse parameter is equivalent to rm -rf and should be used with care.

Retrieving Information from HDFS

Like the hdfs dfs command, the client library contains multiple methods that allow data to be retrieved from HDFS. To re-create files from HDFS to the local filesystem, use the copyToLocal() method. Example one-4 copies the file /input/input.txt from HDFS and places it under the /tmp directory on the local filesystem.

Instance 1-iv. python/HDFS/copy_to_local.py

                  from                  snakebite.client                  import                  Client                  client                  =                  Customer                  (                  'localhost'                  ,                  9000                  )                  for                  f                  in                  client                  .                  copyToLocal                  ([                  '/input/input.txt'                  ],                  '/tmp'                  ):                  impress                  f

Executing the copy_to_local.py awarding produces the following result:

$ python copy_to_local.py  {'path': '/tmp/input.txt', 'source_path': '/input/input.txt', 'issue': Truthful, 'error': ''}

To simply read the contents of a file that resides on HDFS, the text() method can exist used. Example one-5 displays the content of /input/input.txt .

Example 1-5. python/HDFS/text.py

                  from                  snakebite.customer                  import                  Client                  client                  =                  Customer                  (                  'localhost'                  ,                  9000                  )                  for                  l                  in                  customer                  .                  text                  ([                  '/input/input.txt'                  ]):                  print                  fifty

Executing the text.py application produces the following results:

$ python text.py  jack exist nimble jack exist quick jack jumped over the candlestick

The text() method will automatically uncompress and brandish gzip and bzip2 files.

CLI Client

The CLI client included with Snakebite is a Python command-line HDFS client based on the client library. To execute the Snakebite CLI, the hostname or IP address of the NameNode and RPC port of the NameNode must be specified. While there are many means to specify these values, the easiest is to create a ~.snakebiterc configuration file. Example 1-six contains a sample config with the NameNode hostname of localhost and RPC port of 9000.

Example ane-6. ~/.snakebiterc

                {                "config_version"                :                two                ,                "skiptrash"                :                truthful                ,                "namenodes"                :                [                {                "host"                :                "localhost"                ,                "port"                :                9000                ,                "version"                :                9                },                ]                }

The values for host and port tin can exist found in the hadoop/conf/core-site.xml configuration file nether the belongings fs.defaultFS.

For more information on configuring the CLI, see the Snakebite CLI documentation online.

Usage

To use the Snakebite CLI client from the command line, simply use the command snakebite. Utilise the ls option to display the contents of a directory:

$ snakebite ls / Found 2 items drwx------   - hadoop    supergroup    0 2015-09-xx 14:36 /tmp drwxr-xr-x   - hadoop    supergroup    0 2015-09-twenty eleven:40 /user

Like the hdfs dfs command, the CLI client supports many familiar file manipulation commands (eastward.thousand., ls, mkdir, df, du, etc.).

The major difference between snakebite and hdfs dfs is that snakebite is a pure Python client and does not need to load whatsoever Coffee libraries to communicate with HDFS. This results in quicker interactions with HDFS from the command line.

CLI Control Reference

The following is a total listing of file manipulation commands possible with the snakebite CLI customer. This listing can be displayed from the command line past specifying snakebite without any arguments. To view help with a specific command, use snakebite [cmd] --help, where cmd is a valid snakebite command.

snakebite [general options] cmd [arguments] general options:   -D --debug               Testify debug information   -5 --version             Hadoop protocol version (default:9)   -h --help                show assist   -j --json                JSON output   -northward --namenode            namenode host   -p --port                namenode RPC port (default: 8020)   -five --ver                 Display snakebite version  commands:   true cat [paths]                  re-create source paths to stdout   chgrp <grp> [paths]          modify grouping   chmod <mode> [paths]         modify file mode (octal)   chown <owner:grp> [paths]    change owner   copyToLocal [paths] dst      copy paths to local                                   file system destination   count [paths]                brandish stats for paths   df                           brandish fs stats   du [paths]                   display disk usage statistics   get file dst                 copy files to local                                   file organization destination   getmerge dir dst             concatenates files in source dir                                  into destination local file   ls [paths]                   listing a path   mkdir [paths]                create directories   mkdirp [paths]               create directories and their                                   parents   mv [paths] dst               motility paths to destination   rm [paths]                   remove paths   rmdir [dirs]                 delete a directory   serverdefaults               show server information   setrep <rep> [paths]         set replication factor   stat [paths]                 stat information   tail path                    display last kilobyte of the                                   file to stdout   test path                    test a path   text path [paths]            output file in text format   touchz [paths]               creates a file of zero length   usage <cmd>                  evidence cmd usage  to see control-specific options use: snakebite [cmd] --help

Chapter Summary

This chapter introduced and described the core concepts of HDFS. It explained how to interact with the filesystem using the built-in hdfs dfs command. It also introduced the Python library, Snakebite. Snakebite'due south customer library was explained in detail with multiple examples. The snakebite CLI was also introduced as a Python alternative to the hdfs dfs control.

grequodwilliams.blogspot.com

Source: https://www.oreilly.com/library/view/hadoop-with-python/9781492048435/ch01.html