PySpark Snap Setup for Windows
Overview
This document provides comprehensive instructions for setting up PySpark on a Windows system, including prerequisites, installation steps, example, and troubleshooting.
- Python and Java: You'll need to have both Python and Java 11 installed on your system.
- For Spark 3.X, use Java 11
- Python version for Spark 3.5.6: If you're using Spark 3.5.6, Python versions 3.10.x or 3.11.x are recommended. This guide uses Python v3.10.5.
- Add Python to path: During Python installation, make sure to select the option to "Add Python to PATH."
- Add Python to path: When setting your
JAVA_HOMEenvironment variable, ensure there are no spaces in the path.- Correct:
C:\jdk11\bin - Incorrect:
C:\Program Files\jdk11\bin
- Correct:
-
Download Spark: Visit the Apache Spark archive and select the appropriate package (with or without Hadoop, depending on your system's configuration).
-
Download Winutils: Go to the winutils GitHub repository. Choose the latest Hadoop version and download the
winutils.exefile from that specific release (example, Hadoop 3.3.1 winutils.exe). -
Download WinRAR or 7-Zip: Download a file extraction tool like WinRAR or 7-Zip. This is necessary because the downloaded Spark file will be in
.tgzformat and requires extraction.
-
Move and rename Spark folder:
-
Move the extracted Spark folder to your
C:drive. -
Rename the folder to
spark(C:\spark).
-
- Add Spark to path (SPARK_HOME):
- Navigate to Windows > System Properties.
- Click Environment Variables.
- Under System variables, click New...
- In the Variable name field, enter
SPARK_HOME. - In the Variable value field, enter
C:\spark. - Click OK.
- Select the Path variable from System variables and click Edit.
- Click New and add
%SPARK_HOME%\bin. - Click OK.
- Create Hadoop folder and add Winutils:
-
Create a new folder named
hadoopin yourC:drive (C:\hadoop). -
Inside the
hadoopfolder, create abinfolder. -
Move the
winutils.exefile (which you have downloaded earlier) into thisbinfolder (C:\hadoop\bin\winutils.exe).
-
- Add Hadoop to Path (HADOOP_HOME): Add this path to your system environment
variables, similar to Spark:
-
Navigate to Windows > System Properties > Environment Variables.
- Under "System variables," click New....
- In the "Variable name" field, enter
HADOOP_HOME. -
In the "Variable value" field, enter
C:\hadoop. - Click OK.
- Select the "Path" variable from "System variables" and click Edit.
- Click New and add
%HADOOP_HOME%\bin. - Click OK.
-
-
$ spark-shell:This command opens a Spark shell directly in your terminal, allowing you to run Spark scripts. -
$ pyspark:This command functions similar tospark-shell, opening a PySpark shell for Python-based Spark interactions.
# Simple data
data = [
("Alice", 25, "Engineer"),
("Bob", 30, "Data Scientist"),
("Charlie", 35, "Manager"),
("Diana", 28, "Analyst")
]
columns = ["name", "age", "role"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the data
df.show()
# Show with specific number of rows
df.show(2) # Show only 2 rowsWith these steps, your PySpark installation is complete.
To run the PySpark Snap from the Control Plane on your Groundplex, download the installation files. Learn more about how to download the installation files in your Groundplex.
JAVA_HOME path does not have any spaces. Troubleshooting
Here’s the troubleshooting for issues when running the PySpark Snap on Windows Groundplex.
| Error | Resolution |
|---|---|
Exception in thread "main" java.io.IOException: Cannot run program
"python3": CreateProcess error=2, The system cannot find the file
specified |
Note: If you encounter the same issue in Linux, reach out to [email protected]
|