Read Parquet files from HDFS, S3, and Kerberos-secured clusters

This example demonstrates various ways to configure the Parquet Reader Snap to read Parquet files from HDFS, S3, Kerberos-secured clusters, and using the Catalog Query Snap for schema information.

Download the Catalog Query example pipeline.

Display files in a directory using a filter.
- Directory: Enter the directory path containing Parquet files.
- Filter: Use *.parquet to display all Parquet files in the directory.
- Leave the Filename field empty to list all matching files.
The Snap displays all files with the applied filter in that directory.
Read from a local HDFS instance.
- Directory: Enter the HDFS path (for example, /tmp/test.parquet).
The Snap reads the Parquet file from the local HDFS instance.
Read from an S3 instance.
- Create an S3 account or use an existing one.
- For a regular S3 account: Name the account and supply the Access-key ID and Secret key.
- For an IAM role-enabled account:
  - Select the IAM role checkbox.
  - Leave the Access-key ID and Secret key blank.
  - The IAM role properties are optional and can be left blank.
- Directory: Use a valid S3 path in the format s3://<bucket name>/<key name prefix>.
The Snap reads the Parquet file from the S3 bucket.
Read from a Kerberos-secured cluster.
- Configure a Kerberos account with the appropriate authentication settings.
- Associate the Kerberos account with the Parquet Reader Snap.
- Directory: Enter the HDFS path on the Kerberos-secured cluster.
The Snap authenticates using Kerberos and reads the Parquet file from the secured cluster.
Read schema information from the Catalog Query Snap.
This configuration uses the Catalog Query Snap to retrieve schema information with partition support.
- Configure a Catalog Query Snap to query the catalog for table metadata and schema information.
- Connect the Catalog Query Snap to the Parquet Reader Snap.
- The Parquet Reader uses the schema information from the Catalog Query Snap to read partitioned Parquet files.
This approach is useful when working with partitioned tables where the schema is managed in a catalog.

The Parquet Reader Snap successfully reads Parquet files from various storage sources (HDFS, S3, Kerberos-secured clusters) and can integrate with the Catalog Query Snap for dynamic schema retrieval.

To successfully reuse pipelines:

Download and import the pipeline into SnapLogic.
Configure Snap accounts as applicable.
Provide pipeline parameters as applicable.