PySpark Snap Setup for Linux
Overview
This document provides comprehensive instructions for setting up PySpark on a Linux system, including prerequisites, installation steps, example, and troubleshooting.
- Python and Java: You'll need to have both Python and Java 11 installed on your system.
- For Spark 3.X, use Java 11
Download Spark: Visit the Apache Spark archive and select the appropriate package (with or without Hadoop, depending on your system's configuration).
- Extract and set up Spark.
$ tar -xvzf spark-3.5.6-bin-hadoop3.tgz $ mv spark-3.5.6-bin-hadoop3 /home/gaian/software/spark - Configure Environment variables.
- Edit your
.bashrc.$ nano ~/.bashrc - Add the following at the
end.
export SPARK_HOME=/home/gaian/software/spark export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
- Edit your
- Reload.
$ source ~/.bashrc
$ spark-shell:This command opens a Spark shell directly in your terminal, allowing you to run Spark scripts.$ pyspark:This command functions similar to spark-shell, opening a PySpark shell for Python-based Spark interactions.
Here are some basic commands you can try in the PySpark shell:
# Simple data
data = [
("Alice", 25, "Engineer"),
("Bob", 30, "Data Scientist"),
("Charlie", 35, "Manager"),
("Diana", 28, "Analyst")
]
columns = ["name", "age", "role"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the data
df.show()
# Show with specific number of rows
df.show(2) # Show only 2 rowsWith these steps, your PySpark installation is complete.
Post installation steps
To run the PySpark Snap from the Control Plane on your Groundplex, follow the instructions below:
ps -ef | grep snaplogic), so make sure that the user can read
your Spark install and
script:$ sudo chmod +x /home/gaian/software/spark/bin/spark-submit
$ sudo chmod o+x /home/xyz
$ sudo chmod o+x /home/xyz/software
$ sudo chmod o+x /home/xyz/software/spark
$ sudo chmod o+x /home/xyz/software/spark/bin
$ sudo chmod +r /path/to/your/pyspark_script.pyRun the Scripts
Spark Home: This specifies the path to your Spark folder, where your
bin folder is located.
Default Script: You get the default script when you drag and drop the PySpark Snap into the designer and click on Edit PySpark Script to edit the script.
- Old script
<The old script content would be here if provided>
- Modified Script
Python
import sys from operator import add from pyspark import SparkContext if __name__ == "__main__": if len(sys.argv) < 3: print("Usage: wordcount <master> <file>", file=sys.stderr) sys.exit(-1) sc = SparkContext(sys.argv[1], "WordCount") lines = sc.textFile(sys.argv[2], 1) counts = ( lines.flatMap(lambda x: x.split(" ")) .map(lambda x: (x, 1)) .reduceByKey(add) ) output = counts.collect() for (word, count) in output: print(f"{word}: {count}")Since the default script expects arguments, specifically an input text file for word counting, you need to specify the path of that file in the Script args field.
Terminal Command:
$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/word_count.py
local[*] /home/xyz/spark_scripts/sample.txt
Output
Hello: 4
my: 6
name: 6
is: 6
john: 3
doe: 3
bye: 3
i: 1
am: 1
working: 1
in: 2
xyz: 1
Solutions: 1
MTS-I: 1
role.: 1
-
The PySpark script Snap includes the default script, which can be accessed by clicking the Edit PySpark Script button. If you want to execute the above command, you must manually create the file named "
wordcount.py", which contains the default code in our directory. -
Sample.txt is the file in which the program takes this as input and counts the number of words in this file.
Custom Script
To execute a custom script in our file system, specify the path to it in the Spark submit args field.
$ /home/xyz/software/spark/bin/spark-submit
/home/xyz/spark_scripts/sample_script.py
Spark Version: 3.5.6
Python Version: 3.12
Sample Data:
+---+-------+
| id|message|
+---+-------+
| 1| Hello|
| 2|PySpark|
| 3| Ubuntu|
+---+-------+from pyspark.sql import SparkSession# Create Spark session
spark = SparkSession.builder \
.appName("Ubuntu Virtual Env Test") \
.getOrCreate()print(f"Spark Version: {spark.version}")
print(f"Python Version: {spark.sparkContext.pythonVer}")
# Create sample data
data = [(1, "Hello"), (2, "PySpark"), (3, "Ubuntu")]
df = spark.createDataFrame(data, ["id", "message"])
print("\nSample Data:")
df.show()
# Stop session
spark.stop()
print("Test completed successfully!")
Terminal Command (save the above file as samplescript.py)
$
/home/xyz/software/spark/bin/spark-submit
/home/xyz/spark_scripts/sample_script.pySpark Version: 3.5.6
Python Version: 3.12
Sample Data:
+---+-------+
| id|message|
+---+-------+
| 1| Hello|
| 2|PySpark|
| 3| Ubuntu|
+---+-------+Start Spark Master and Worker ( Optional )
Start Spark Master
$ $SPARK_HOME/sbin/start-master.sh
Note the master URL from the console output or the web UI (default:
spark://<hostname>:7077).
$ $SPARK_HOME/sbin/start-worker.sh spark://<hostname>:7077# stop any running spark processes (safe to run)
$ $SPARK_HOME/sbin/stop-all.sh
# To stop just the worker ( not needed when you just want to stop the worker instead of all )
$ $SPARK_HOME/sbin/stop-worker.sh
# (optional) clear stale work dir that can cause weird state
$ rm -rf $SPARK_HOME/work/* /tmp/spark-* 2>/dev/null || true
# start master
$ $SPARK_HOME/sbin/start-master.sh
# start worker and connect to master (use precise host below)
$ $SPARK_HOME/sbin/start-worker.sh spark://<host-name>:7077Confirm the master started and get its URL
After starting, run:
# check processes
ps -ef | grep -E 'org.apache.spark.deploy.master.Master|start-master' | grep -v grep
# check ports again
ss -ltnp | grep ':7077' || netstat -plnt | grep 7077
# open master web UI
curl -sS http://<ip address of the machine where the spark is installed>:8080 | sed -n '1,5p'
Expected: The master process is present and port 7077 is listening, and the web UI returns HTML (or visit:http://< ip address of the machine where the spark is installed >:8080 in your browser).