Hadoop Directory Browser

Overview

This Snap browses a given directory path in the Hadoop file system (using the HDFS protocol) and generates a list of all the files in the directory and subdirectories. Use this Snap to identify the contents of a directory before you run any command that uses this information.

Note: The Hadoop Directory Browser Snap supports URIs using HDFS & ABFS (Azure Data Lake Gen 2 ) protocols.

For example, if you need to iteratively run a specific command on a list of files, this Snap can help you view the list of all available files.

Path (string): The path to the directory being browsed.
Type (string): The type of file.
Owner (string): The name of the owner of the file.
Creation date (datetime): The date the file was created. In the Hadoop file system, this can often show up as 'null' due to limited API functionality.
Size (in bytes) (int): The size of the file.
Permissions (string): Read, Write, Execute.
Update date (datetime): Date of update.
Name (string): Name of the file.

This is a Read-type Snap.
Works in Ultra Tasks

Prerequisites

A Groundplex needs to be configured as a Hadoop client.
The user executing the Snap must have at least Read permissions in the concerned directory.

Snap views


Type	Description	Examples of upstream and downstream Snaps
Input	This Snap has at most one optional document input view. It contains values for the directory path to be browsed and the glob filter to be applied to select the contents. Directory Path to be browsed and the File Filter Pattern to be applied. For example: Directory Path: hdfs://hadoopcluster.domain.com:8020/<user>/<folder_details>; File Filter: *.conf.	Mapper Any Snap that offers a directory URI. This can be even a CSV Generator with a collection of file names and their URIs.
Output	This Snap has exactly one output view that provides the various attributes (such as Name, Type, Size, Owner, Last Modification Time) of the contents of the given directory path. Only those contents are selected that match the given glob filter. The attributes of the files contained in the directory specified that match the filter pattern.	Mapper A document listing out attributes of the files contained in the directory specified.
Learn more about Error handling.

Supported Accounts

Note: The security model configured for the Groundplex (SIMPLE or KERBEROS authentication) must match the security model of the remote server. Due to limitations of the Hadoop library we are only able to create the necessary internal credentials for the configuration of the Groundplex.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.


Field/Field set	Description
Label `String`	Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: Hadoop Directory Browser Example: Browse HDFS directory
Directory `String/Expression/ Suggestion`	The URL for the data source (directory). The Snap supports both HDFS and ABFS(S) protocols. Syntax for a typical HDFS URL: `hdfs://hadoopcluster.domain.com:8020/<user>/<folder_details>` Syntax for a typical ABFS and an ABFSS URL: `abfs:///<filesystem>/<path>/ abfs://<filesystem>@<accountname>.<endpoint>/<path> abfss:///<filesystem>/<path>/ abfss://<filesystem>@<accountname>.<endpoint>/<path>` When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields. Default value: [None]
File filter `String/Expression`	Required. The GLOB pattern to be applied to select the contents (files/sub-folders) of the directory. You cannot recursively navigate the directory structures. The File filter property can be a JavaScript expression, which will be evaluated with the values from the input view document. Example: `.txt` `ab????xx.x` `.[jJ][sS][oO][nN]`(as of the May 29th, 2015 release) Default*: None
User Impersonation `Checkbox`	Select this check box to enable user impersonation. For more information on working with user impersonation, see the HDFS Reader Snap documentation. Default status: Deselected
Ignore empty result `Checkbox`	If selected, no document will be written to the output view when the result is empty. If this property is not selected and the Snap receives an input document, the input document is passed to the output view. If this property is not selected and there is no input document, an empty document is written to the output view. Default status: Selected
Snap execution `Dropdown list`	Choose one of the three modes in which the Snap executes. Available options are: `Validate & Execute`: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime. `Execute only`: Performs full execution of the Snap during pipeline execution without generating preview data. `Disabled`: Disables the Snap and all Snaps that are downstream from it. Default value: Execute only Example: Validate & Execute

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

Go to HDFS configuration.
In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
Click Save.
Restart all the nodes.
Under Restart Stale Services, select Re-deploy client configuration.
Click Restart Now.