The following example pipeline demonstrates how to execute a PySpark
script on Windows Groundplex to read and display CSV data. It takes a CSV file path as an
argument, reads the file using Spark DataFrame operations, and displays the data content.
The pipeline is configured to run locally on Windows Groundplex.

Download this Pipeline.
-
Configure the PySpark script Snap as follows:
-
Click the Edit PySpark script button and provide the following script. The
script is configured to run with spark-submit using local master mode with all available
cores.
This is a PySpark script designed to read and display CSV data. It's built to accept a
file path as a command-line argument and process it using Apache Spark's DataFrame
operations.
PySfrom pyspark.sql import SparkSession
import sys
import os
spark = SparkSession.builder.appName("PySpark_ScriptArgs_File").getOrCreate()
Check if file argument is provided
if len(sys.argv) > 1:
file_path = sys.argv[1]
if os.path.exists(file_path):
# Example: Read CSV as DataFrame
df = spark.read.csv(file_path, header=True, inferSchema=True)
print("Data from file:")
df.show()
else:
print(f"File does not exist: {file_path}")
spark.stop()
This script is executed with the argument
C:/pyspark_scripts/input_data.csv. It reads and displays that
specific CSV file using Spark's distributed processing capabilities, even though it's
running locally with all available CPU cores (—-master local[*])
-
Validate the Snap. On validation, the Snap displays the following output.
-
Execute the Snap. On execution, it executes a PySpark script that creates a Spark
session, accepts a CSV file path as a command-line argument, reads the CSV file into a
DataFrame with header and schema inference, displays the data, and stops the Spark
session.
To successfully reuse pipelines:
- Download and import the pipeline into SnapLogic.
- Configure Snap accounts as applicable.
- Provide pipeline parameters as applicable.