Partition Parquet files by specific fields

This example demonstrates how to use the Partition By functionality in the Parquet Writer Snap to organize output files into subdirectories based on field values.

Download this pipeline.

Sample input data for this example:

[
  { "month" : "MAR", "day" : "01", "msg" : "Hello, World", "num" : 1 },
  { "month" : "FEB", "day" : "07", "msg" : "Hello, World", "num" : 3 },
  { "month" : "MAR", "day" : "01", "msg" : "Hello, World", "num" : 2 },
  { "month" : "FEB", "day" : "07", "msg" : "Hello, World", "num" : 4 }
]
  1. Configure the Parquet Writer Snap with partition settings.
    • Directory: Enter the base output directory (for example, hdfs://localhost:8080/tmp).
    • Filename: Specify the Parquet filename (for example, sample.parquet).
    • Partition By: Add the fields to partition by:
      • Add month as the first partition field.
      • Add day as the second partition field.

    The Snap will create subdirectories based on the partition field values and write separate Parquet files for each unique combination.

  2. For S3 destinations, configure IAM role settings if needed.
    • Create an S3 account or use an existing one.
    • For a regular S3 account: Name the account and supply the Access-key ID and Secret key.
    • For an IAM role-enabled account:
      • Select the IAM role checkbox.
      • Leave the Access-key ID and Secret key blank.
      • The IAM role properties are optional and can be left blank.
    • Use a valid S3 path in the format: s3://<bucket name>/<folder name>/.../<filename>

The pipeline execution generates separate Parquet files organized in subdirectories based on the partition fields:

  • hdfs://localhost:8080/tmp/FEB/07/sample.parquet (contains records where month=FEB and day=07)
  • hdfs://localhost:8080/tmp/MAR/01/sample.parquet (contains records where month=MAR and day=01)

This partitioning strategy improves query performance by allowing selective reading of only the required partitions.

To successfully reuse pipelines:
  1. Download and import the pipeline into SnapLogic.
  2. Configure Snap accounts as applicable.
  3. Provide pipeline parameters as applicable.