RC File Formatter

The RC File Formatter Snap formats incoming documents from upstream Snaps to the RC (Row columnar) file format.

Overview

This Snap formats the incoming document from the upstream Snaps to the RC (Row columnar) file format used for storing data in an optimized way to answer aggregate queries faster.

Snap views

Input/Output Type of View Examples of Upstream and Downstream Snaps
Input Document This Snap has at most one document input view. The upstream Snap should output table oriented data with columns and rows.
Output Document The RC File Formatter Snap outputs binary data, so the downstream Snap must be a data output Snap, for example, HDFS Writer.
Error This Snap has at most one document error view and produces zero or more documents in the view.

Supported Accounts

Accounts are not used with this Snap.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.
Field Name Description
Label*

Default value: RC File Formatter

Example: RC File Formatter

Type: String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.
Hive Metastore URL

Default value: [None]

Example: thrift://hive.metastore.com:9083

Type: String/Expression

Hive Metastore URI, such as: thrift://localhost:9083

Database

Default value: [None]

Example: hive_db

Type: String/Expression/ Suggestion

Database which holds the schema for the outgoing RC File data.

Table

Default value: [None]

Example: hive_tbl

Type: String/Expression/ Suggestion

Table whose schema should be used for parsing the outgoing RC file data.

Column paths*

Default value: [None]

Example:

Column Name: Fun

Column Path: $column_from_input_data

Column Type: string

Type: Table

Required. Paths where the column values appear in the document.

  • Column Name: Name of the column.
  • Column Path: JSONPath to the column value in the input document.
  • Column Type: Data type of the column (string, int, etc.).
Snap Execution

Default value: Execute only

Example: Validate & Execute

Type: Dropdown list

Select one of the following three modes in which the Snap executes:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

  1. Go to HDFS configuration.
  2. In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
    • Name: fs.s3a.threads.max
    • Value: 15
  3. Click Save.
  4. Restart all the nodes.
  5. Under Restart Stale Services, select Re-deploy client configuration.
  6. Click Restart Now.