Partition Parquet files by specific fields
This example demonstrates how to use the Partition By functionality in the Parquet Writer Snap to organize output files into subdirectories based on field values.
Sample input data for this example:
[
{ "month" : "MAR", "day" : "01", "msg" : "Hello, World", "num" : 1 },
{ "month" : "FEB", "day" : "07", "msg" : "Hello, World", "num" : 3 },
{ "month" : "MAR", "day" : "01", "msg" : "Hello, World", "num" : 2 },
{ "month" : "FEB", "day" : "07", "msg" : "Hello, World", "num" : 4 }
]
The pipeline execution generates separate Parquet files organized in subdirectories based on the partition fields:
hdfs://localhost:8080/tmp/FEB/07/sample.parquet(contains records where month=FEB and day=07)hdfs://localhost:8080/tmp/MAR/01/sample.parquet(contains records where month=MAR and day=01)
This partitioning strategy improves query performance by allowing selective reading of only the required partitions.
- Download and import the pipeline into SnapLogic.
- Configure Snap accounts as applicable.
- Provide pipeline parameters as applicable.