PySpark Snap Setup for Linux

Overview

This document provides comprehensive instructions for setting up PySpark on a Linux system, including prerequisites, installation steps, example, and troubleshooting.

Prerequisites
  • Python and Java: You'll need to have both Python and Java 11 installed on your system.
    • For Spark 3.X, use Java 11
Pre-installation

Download Spark: Visit the Apache Spark archive and select the appropriate package (with or without Hadoop, depending on your system's configuration).

Install Spark
  1. Extract and set up Spark.
    $ tar -xvzf spark-3.5.6-bin-hadoop3.tgz $ mv spark-3.5.6-bin-hadoop3 /home/gaian/software/spark
  2. Configure Environment variables.
    1. Edit your .bashrc.
      $ nano ~/.bashrc
      
    2. Add the following at the end.
      export SPARK_HOME=/home/gaian/software/spark
      export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
  3. Reload.
    $ source ~/.bashrc
    
Run Commands
To interact with Spark, you can open either a Spark shell or a PySpark shell in your terminal and execute commands.
  • $ spark-shell: This command opens a Spark shell directly in your terminal, allowing you to run Spark scripts.
  • $ pyspark: This command functions similar to spark-shell, opening a PySpark shell for Python-based Spark interactions.
Examples: Basic commands

Here are some basic commands you can try in the PySpark shell:

# Simple data
data = [
    ("Alice", 25, "Engineer"),
    ("Bob", 30, "Data Scientist"),
    ("Charlie", 35, "Manager"),
    ("Diana", 28, "Analyst")
]
columns = ["name", "age", "role"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the data
df.show()
# Show with specific number of rows
df.show(2)  # Show only 2 rows

With these steps, your PySpark installation is complete.

Post installation steps

To run the PySpark Snap from the Control Plane on your Groundplex, follow the instructions below:

File permissions
The Groundplex process might run as a different user (check with ps -ef | grep snaplogic), so make sure that the user can read your Spark install and script:
$ sudo chmod +x /home/gaian/software/spark/bin/spark-submit
$ sudo chmod o+x /home/xyz
$ sudo chmod o+x /home/xyz/software
$ sudo chmod o+x /home/xyz/software/spark
$ sudo chmod o+x /home/xyz/software/spark/bin
$ sudo chmod +r /path/to/your/pyspark_script.py

Run the Scripts

Spark Home: This specifies the path to your Spark folder, where your bin folder is located.

Default Script: You get the default script when you drag and drop the PySpark Snap into the designer and click on Edit PySpark Script to edit the script.

Tip: The default script is for Python version 2. Since we are using Python 3, which is compatible with Spark, you need to modify the script as below.
  • Old script

    <The old script content would be here if provided>

  • Modified Script

    Python

    import sys
    
    from operator import add
    
    from pyspark import SparkContext
    
    if __name__ == "__main__":
    
        if len(sys.argv) < 3:
    
            print("Usage: wordcount <master> <file>", file=sys.stderr)
    
            sys.exit(-1)
    
        sc = SparkContext(sys.argv[1], "WordCount")
    
        lines = sc.textFile(sys.argv[2], 1)
    
        counts = (
    
            lines.flatMap(lambda x: x.split(" "))
    
                 .map(lambda x: (x, 1))
    
                 .reduceByKey(add)
    
        )
    
        output = counts.collect()
    
        for (word, count) in output:
    
            print(f"{word}: {count}")

    Since the default script expects arguments, specifically an input text file for word counting, you need to specify the path of that file in the Script args field.

Terminal Command:

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/word_count.py local[*] /home/xyz/spark_scripts/sample.txt

Output

Hello: 4
my: 6
name: 6
is: 6
john: 3
doe: 3
bye: 3
i: 1
am: 1
working: 1
in: 2
xyz: 1
Solutions: 1
MTS-I: 1
role.: 1
Tip:
  • The PySpark script Snap includes the default script, which can be accessed by clicking the Edit PySpark Script button. If you want to execute the above command, you must manually create the file named "wordcount.py", which contains the default code in our directory.

  • Sample.txt is the file in which the program takes this as input and counts the number of words in this file.

Custom Script

To execute a custom script in our file system, specify the path to it in the Spark submit args field.

Terminal Command

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/sample_script.py

Output
Spark Version: 3.5.6
Python Version: 3.12
Sample Data:
+---+-------+
| id|message|
+---+-------+
|  1|  Hello|
|  2|PySpark|
|  3| Ubuntu|
+---+-------+
Sample script to save as a .py file and execute from the terminal, or give it as a custom script to the PySpark Snap.
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
    .appName("Ubuntu Virtual Env Test") \
    .getOrCreate()
print(f"Spark Version: {spark.version}")
print(f"Python Version: {spark.sparkContext.pythonVer}")
# Create sample data
data = [(1, "Hello"), (2, "PySpark"), (3, "Ubuntu")]
df = spark.createDataFrame(data, ["id", "message"])
print("\nSample Data:")
df.show()
# Stop session
spark.stop()
print("Test completed successfully!")

Terminal Command (save the above file as samplescript.py)

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/sample_script.py
Output
Spark Version: 3.5.6
Python Version: 3.12
Sample Data:
+---+-------+
| id|message|
+---+-------+
|  1|  Hello|
|  2|PySpark|
|  3| Ubuntu|
+---+-------+

Start Spark Master and Worker ( Optional )

Start Spark Master

$ $SPARK_HOME/sbin/start-master.sh

Note the master URL from the console output or the web UI (default: spark://<hostname>:7077).

Start Spark Worker (connect it to master)
$ $SPARK_HOME/sbin/start-worker.sh spark://<hostname>:7077
To stop and start/restart the spark
Stop any old processes and start master and worker fresh:
# stop any running spark processes (safe to run)
$ $SPARK_HOME/sbin/stop-all.sh
# To stop just the worker ( not needed when you just want to stop the worker instead of all )
$ $SPARK_HOME/sbin/stop-worker.sh 
# (optional) clear stale work dir that can cause weird state
$ rm -rf $SPARK_HOME/work/* /tmp/spark-* 2>/dev/null || true
# start master
$ $SPARK_HOME/sbin/start-master.sh
# start worker and connect to master (use precise host below)
$ $SPARK_HOME/sbin/start-worker.sh spark://<host-name>:7077

Confirm the master started and get its URL

After starting, run:

# check processes
ps -ef | grep -E 'org.apache.spark.deploy.master.Master|start-master' | grep -v grep
# check ports again
ss -ltnp | grep ':7077' || netstat -plnt | grep 7077
# open master web UI
curl -sS http://<ip address of the machine where the spark is installed>:8080 | sed -n '1,5p'

Expected: The master process is present and port 7077 is listening, and the web UI returns HTML (or visit:http://< ip address of the machine where the spark is installed >:8080 in your browser).

Example: