Read CSV data with PySpark

The following example pipeline demonstrates how to execute a PySpark script on Windows Groundplex to read and display CSV data. It takes a CSV file path as an argument, reads the file using Spark DataFrame operations, and displays the data content. The pipeline is configured to run locally on Windows Groundplex.



Download this Pipeline.
  1. Configure the PySpark script Snap as follows:


  2. Click the Edit PySpark script button and provide the following script. The script is configured to run with spark-submit using local master mode with all available cores.

    This is a PySpark script designed to read and display CSV data. It's built to accept a file path as a command-line argument and process it using Apache Spark's DataFrame operations.

    PySfrom pyspark.sql import SparkSession
    import sys
    import os
    spark = SparkSession.builder.appName("PySpark_ScriptArgs_File").getOrCreate()
    Check if file argument is provided
    if len(sys.argv) > 1:
        file_path = sys.argv[1]
        if os.path.exists(file_path):
        # Example: Read CSV as DataFrame
        df = spark.read.csv(file_path, header=True, inferSchema=True)
        print("Data from file:")
        df.show()
    else:
        print(f"File does not exist: {file_path}")
        spark.stop()

    This script is executed with the argument C:/pyspark_scripts/input_data.csv. It reads and displays that specific CSV file using Spark's distributed processing capabilities, even though it's running locally with all available CPU cores (—-master local[*])

  3. Validate the Snap. On validation, the Snap displays the following output.


  4. Execute the Snap. On execution, it executes a PySpark script that creates a Spark session, accepts a CSV file path as a command-line argument, reads the CSV file into a DataFrame with header and schema inference, displays the data, and stops the Spark session.
To successfully reuse pipelines:
  1. Download and import the pipeline into SnapLogic.
  2. Configure Snap accounts as applicable.
  3. Provide pipeline parameters as applicable.