The following example pipeline demonstrates how to execute a PySpark
script on Linux Grouplex to count the number of words in an application that processes text
files to count the frequency of each word. It uses the PySpark framework to distribute the
computation across a Spark cluster, reading text files, splitting them into words, and
counting occurrences of each word using map-reduce operations.

Download this Pipeline.
-
Configure the PySpark script Snap as follows:
-
Click the Edit PySpark script button and provide the following script. The script reads
text files, splits lines into words using flatMap, maps each word to a tuple, reduces by
key to count word frequencies, and collects the results to print word counts. The Spark
application runs on Spark 3.5.6 with Hadoop 3 binaries.
import sys
from operator import add
from pyspark import SparkContext
if name == "main":
if len(sys.argv) < 3:
print >> sys.stderr, "Usage: wordcount <master> <file>"
exit(-1)
sc = SparkContext()
lines = sc.textFile(sys.argv[2], 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
-
Validate the Snap. On validation, the Snap displays the following output.
To successfully reuse pipelines:
- Download and import the pipeline into SnapLogic.
- Configure Snap accounts as applicable.
- Provide pipeline parameters as applicable.