PySpark Snap Setup for Windows

Overview

This document provides comprehensive instructions for setting up PySpark on a Windows system, including prerequisites, installation steps, example, and troubleshooting.

Prerequisites
  • Python and Java: You'll need to have both Python and Java 11 installed on your system.
    • For Spark 3.X, use Java 11
  • Python version for Spark 3.5.6: If you're using Spark 3.5.6, Python versions 3.10.x or 3.11.x are recommended. This guide uses Python v3.10.5.
  • Add Python to path: During Python installation, make sure to select the option to "Add Python to PATH."
  • Add Python to path: When setting your JAVA_HOME environment variable, ensure there are no spaces in the path.
    • Correct: C:\jdk11\bin
    • Incorrect: C:\Program Files\jdk11\bin
Pre-Installed steps
  1. Download Spark: Visit the Apache Spark archive and select the appropriate package (with or without Hadoop, depending on your system's configuration).

  2. Download Winutils: Go to the winutils GitHub repository. Choose the latest Hadoop version and download the winutils.exe file from that specific release (example, Hadoop 3.3.1 winutils.exe).

  3. Download WinRAR or 7-Zip: Download a file extraction tool like WinRAR or 7-Zip. This is necessary because the downloaded Spark file will be in .tgz format and requires extraction.

Install Spark
Now that you have downloaded and extracted the Spark file, follow these steps:
  1. Move and rename Spark folder:

    1. Move the extracted Spark folder to your C: drive.

    2. Rename the folder to spark (C:\spark).

  2. Add Spark to path (SPARK_HOME):
    1. Navigate to Windows > System Properties.
    2. Click Environment Variables.
    3. Under System variables, click New...
    4. In the Variable name field, enter SPARK_HOME.
    5. In the Variable value field, enter C:\spark.
    6. Click OK.
    7. Select the Path variable from System variables and click Edit.
    8. Click New and add %SPARK_HOME%\bin.
    9. Click OK.
  3. Create Hadoop folder and add Winutils:
    1. Create a new folder named hadoop in your C: drive (C:\hadoop).

    2. Inside the hadoop folder, create a bin folder.

    3. Move the winutils.exe file (which you have downloaded earlier) into this bin folder (C:\hadoop\bin\winutils.exe).

  4. Add Hadoop to Path (HADOOP_HOME): Add this path to your system environment variables, similar to Spark:
    1. Navigate to Windows > System Properties > Environment Variables.

    2. Under "System variables," click New....
    3. In the "Variable name" field, enter HADOOP_HOME.
    4. In the "Variable value" field, enter C:\hadoop.

    5. Click OK.
    6. Select the "Path" variable from "System variables" and click Edit.
    7. Click New and add %HADOOP_HOME%\bin.
    8. Click OK.
Run Commands
To interact with Spark, you can open either a Spark shell or a PySpark shell in your terminal and execute commands.
  • $ spark-shell: This command opens a Spark shell directly in your terminal, allowing you to run Spark scripts.

  • $ pyspark:This command functions similar to spark-shell, opening a PySpark shell for Python-based Spark interactions.

Examples: Basic commands
Here are some basic commands you can try in the PySpark shell:
# Simple data
data = [
    ("Alice", 25, "Engineer"),
    ("Bob", 30, "Data Scientist"),
    ("Charlie", 35, "Manager"),
    ("Diana", 28, "Analyst")
]
columns = ["name", "age", "role"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the data
df.show()
# Show with specific number of rows
df.show(2)  # Show only 2 rows

With these steps, your PySpark installation is complete.

Post Installation steps

To run the PySpark Snap from the Control Plane on your Groundplex, download the installation files. Learn more about how to download the installation files in your Groundplex.

Important: Ensure that the JAVA_HOME path does not have any spaces.

Troubleshooting

Here’s the troubleshooting for issues when running the PySpark Snap on Windows Groundplex.

Error Resolution
Exception in thread "main" java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
  1. Find the location of Python. Use the command where python. The path is displayed. For example, pyth3.xpython
  2. In the same location, copy Python to Python 3
Note: If you encounter the same issue in Linux, reach out to [email protected]

Example: