Sequence Formatter

The Sequence Formatter Snap formats incoming documents from upstream Snaps to Hadoop sequence file format.

Overview

This Snap formats the incoming document from the upstream Snaps to Hadoop sequence file format, the native binary data format to persist intermediate data between different stages of MapReduce jobs.
sequence-formatter
Note: To enable Snappy compression for sequence file, a cluster level setting needs to be set. Unlike Parquet or ORC writer the Sequence Formatter Snap will not have "Snappy" listed in the "compression" options.

Snap views

Input/Output Type of View Examples of Upstream and Downstream Snaps
Input The upstream Snap for Sequence Formatter should output map/table/key-value formatted data. Valid data types include String, Integer, Number and Boolean. This Snap has at most one document input view.
Output The Sequence Formatter Snap outputs binary data, so the downstream Snap must be a data store output Snap like (File Writer, HDFS Writer, etc.). This Snap has at most one binary output view.
Error This Snap has at most one document error view and produces zero or more documents in the view.

Supported Accounts

Accounts are not used with this Snap.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.
Field Name Description
Label*

String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Sequence Formatter

Example: Sequence Formatter

Key*

String/Expression

Required. JSON path for the key.

Default value: [None]

Example: $input_column_name

Value*

String/Expression

Required. JSON path for the value.

Default value: [None]

Example: $input_column_name

Compression type

Dropdown list

Sequence file compression type. The options available include:

  • Record: Only values are compressed.
  • Block: Both keys and values are compressed.
  • [None]: Records are uncompressed when none is selected.

Default value: [None]

Compression codec

String/Expression

Fully qualified compression codec class name.

Default value: [None]

Example: org.apache.hadoop.io.compress.GzipCodec

Snap Execution

Dropdown list

Select one of the following three modes in which the Snap executes:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Execute only

Example: Validate & Execute

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

  1. Go to HDFS configuration.
  2. In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
    • Name: fs.s3a.threads.max
    • Value: 15
  3. Click Save.
  4. Restart all the nodes.
  5. Under Restart Stale Services, select Re-deploy client configuration.
  6. Click Restart Now.