Hadoop Directory Browser

Overview

This Snap browses a given directory path in the Hadoop file system (using the HDFS protocol) and generates a list of all the files in the directory and subdirectories. Use this Snap to identify the contents of a directory before you run any command that uses this information.


HDFS Directory Browser Overview

Note: The Hadoop Directory Browser Snap supports URIs using HDFS & ABFS (Azure Data Lake Gen 2 ) protocols.

For example, if you need to iteratively run a specific command on a list of files, this Snap can help you view the list of all available files.

  • Path (string): The path to the directory being browsed.
  • Type (string): The type of file.
  • Owner (string): The name of the owner of the file.
  • Creation date (datetime): The date the file was created. In the Hadoop file system, this can often show up as 'null' due to limited API functionality.
  • Size (in bytes) (int): The size of the file.
  • Permissions (string): Read, Write, Execute.
  • Update date (datetime): Date of update.
  • Name (string): Name of the file.

Hadoop Directory Browser Overview

Prerequisites

  • A Groundplex needs to be configured as a Hadoop client.
  • The user executing the Snap must have at least Read permissions in the concerned directory.

Snap views

Type Description Examples of upstream and downstream Snaps
Input This Snap has at most one optional document input view. It contains values for the directory path to be browsed and the glob filter to be applied to select the contents.

Directory Path to be browsed and the File Filter Pattern to be applied. For example: Directory Path: hdfs://hadoopcluster.domain.com:8020/<user>/<folder_details>; File Filter: *.conf.

Mapper

Any Snap that offers a directory URI. This can be even a CSV Generator with a collection of file names and their URIs.

Output This Snap has exactly one output view that provides the various attributes (such as Name, Type, Size, Owner, Last Modification Time) of the contents of the given directory path. Only those contents are selected that match the given glob filter.

The attributes of the files contained in the directory specified that match the filter pattern.

Mapper

A document listing out attributes of the files contained in the directory specified.

Learn more about Error handling.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.
Field/Field set Description
Label

String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Hadoop Directory Browser

Example: Browse HDFS directory

Directory

String/Expression/ Suggestion

The URL for the data source (directory). The Snap supports both HDFS and ABFS(S) protocols.

Syntax for a typical HDFS URL:

hdfs://hadoopcluster.domain.com:8020/<user>/<folder_details>

Syntax for a typical ABFS and an ABFSS URL:

abfs:///<filesystem>/<path>/
abfs://<filesystem>@<accountname>.<endpoint>/<path>
abfss:///<filesystem>/<path>/
abfss://<filesystem>@<accountname>.<endpoint>/<path>

When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields.

Default value: [None]

File filter

String/Expression

Required. The GLOB pattern to be applied to select the contents (files/sub-folders) of the directory. You cannot recursively navigate the directory structures.

The File filter property can be a JavaScript expression, which will be evaluated with the values from the input view document.

Example:

  • *.txt
  • ab????xx.*x
  • *.[jJ][sS][oO][nN](as of the May 29th, 2015 release)

Default: None

User Impersonation

Checkbox

Select this check box to enable user impersonation. For more information on working with user impersonation, see the HDFS Reader Snap documentation.

Default status: Deselected

Ignore empty result

Checkbox

If selected, no document will be written to the output view when the result is empty. If this property is not selected and the Snap receives an input document, the input document is passed to the output view. If this property is not selected and there is no input document, an empty document is written to the output view.

Default status: Selected

Snap execution
Choose one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Execute only

Example: Validate & Execute

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

  1. Go to HDFS configuration.
  2. In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
    • Name: fs.s3a.threads.max
    • Value: 15
  3. Click Save.
  4. Restart all the nodes.
  5. Under Restart Stale Services, select Re-deploy client configuration.
  6. Click Restart Now.