Read ORC files from HDFS and S3

This example demonstrates how to configure the ORC Reader Snap to read ORC files from both local HDFS instances and S3 instances.

  1. Configure the ORC Reader Snap to read from a local HDFS instance.
    • Directory: Enter the HDFS directory path containing the ORC file (for example, /tmp).
    • Filename: Specify the ORC file to read (for example, file.orc).

    The Snap reads the ORC file from the specified HDFS directory and outputs the file contents.

  2. Configure the ORC Reader Snap to read from a local S3 instance.
    • Directory: Enter the S3 path containing the ORC file (for example, s3://bucket-name/file-path).
    • Filename: Specify the ORC file to read (for example, file.orc).

    The Snap reads the ORC file from the specified S3 location and outputs the file contents.

The ORC Reader Snap successfully reads and outputs the ORC file data from either HDFS or S3 storage.

Troubleshooting:

If you encounter issues reading ORC files from S3, configure the following settings in your HDFS configuration:

  1. Go to HDFS configuration.
  2. In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
    • Name: fs.s3a.threads.max
    • Value: Set an appropriate thread count (for example, 10)
  3. Restart all the nodes.
  4. Under Restart Stale Services, select Re-deploy client configuration.