ORC Reader

Overview

This Snap reads ORC files from SLDB, HDFS, S3, and WASB, and converts the data into documents.
orc-reader-overview

Note: This Snap supports both HDFS (non-Kerberos) and ABFS (Azure Data Lake Storage Gen 2), WASB(Azure storage), and S3 protocols.

This is a Read-type Snap.
Works in Ultra Tasks

Prerequisites

None

Support

Works with SLDB, HDFS, S3, and WASB.
Works in Ultra Tasks.

Known Issue

The upgrade of Azure Storage library from v3.0.0 to v8.3.0 has caused the following issue when using the WASB protocol:

When you use invalid credentials for the WASB protocol in Hadoop Snaps (HDFS Reader, HDFS Writer, ORC Reader, Parquet Reader, Parquet Writer), the pipeline does not fail immediately, instead it takes 13-14 minutes to display the following error:

reason=The request failed with error code null and HTTP code 0. , status_code=error

SnapLogic® is actively working with Microsoft®Support to resolve the issue.

Learn more about Azure Storage library upgrade.

Snap views


Type	Format	Number of Views	Examples of Upstream and Downstream Snaps	Description
Input	Document	Min: 0 Max: 1	Any data transformation or formatting Snaps, such as Mapper or JSON Formatter. Filter	ORC files from SLDB, HDFS, S3, and WASB. Documents containing directory and file information for ORC files to be read.
Output	Document	Min: 1 Max: 1	Mapper JSON Formatter CSV Formatter	Documents with the columns and data from the ORC files.
Error	Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter while running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab: Stop Pipeline Execution: Stops the current pipeline execution if the Snap encounters an error. Discard Error Data and Continue: Ignores the error, discards that record, and continues with the remaining records. Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution. Learn more about Error handling in Pipelines.

Supported Accounts

Snap settings

Note: Learn about the common controls in the Snap settings dialog.


Field/Field set	Description
Label `String`	Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: ORC Reader Example: ORC Reader
Directory `String/Expression/ Suggestion`	The path to a directory from which you want the ORC Reader Snap to read data. All files within the directory must be ORC formatted. Basic directory URI structure HDFS: hdfs://<hostname>:<port>/ S3: s3:///<S3_bucket_name>/<file_path> WASB: wasb:///<WASB_directory>/<file_name> ABFS: abfs:///<filesystem>/<path>/ abfs://<filesystem>@<accountname>.<endpoint>/<path> ABFSS abfss:///<filesystem>/<path>/ abfss://<filesystem>@<accountname>.<endpoint>/<path> When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields. Note: With the ABFS protocol, SnapLogic creates a temporary file to store the incoming data. Therefore, the hard drive where the JCC is running should have enough space to temporarily store all the account data coming in from ABFS. The Directory property is not used in the pipeline execution or preview, and is used only in the Suggest operation. When you press the Suggest icon, the Snap displays a list of subdirectories under the given directory. It generates the list by applying the value of the Filter property. Example: wasb:///snaplogic/srikanth_test123/RedWoodcity Default value: hdfs://<hostname>:<port>/
Filter `String`	The GLOB pattern to be applied to select the files. Example: `.txt` `.orc` Default value: *
File `String/Expression/ Suggestion`	Required for standard mode. Filename or a relative path to a file under the directory given in the Directory property. It should not start with a URL separator "/". The File property can be a JavaScript expression which will be evaluated with values from the input view document. When you press the Suggest icon, it will display a list of regular files under the directory in the Directory property. It generates the list by applying the value of the Filter property. Example: sample.orc tmp/another.orc _filename Default value: [None]
Snap Execution `Dropdown list`	Choose one of the three modes in which the Snap executes. Available options are: `Validate & Execute`. Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime. `Execute only`. Performs full execution of the Snap during pipeline execution without generating preview data. `Disabled`. Disables the Snap and all Snaps that are downstream from it. Default value: Execute only Example: Validate & Execute

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

Go to HDFS configuration.
In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
Click Save.
Restart all the nodes.
Under Restart Stale Services, select Re-deploy client configuration.
Click Restart Now.

Temporary Files

During execution, when larger datasets are processed that exceed the available compute memory, the Snap writes pipeline data to local storage as temporary files to optimize performance. These temporary files are deleted when the Snap/Pipeline execution completes.