Word count using PySpark script

The following example pipeline demonstrates how to execute a PySpark script on Linux Grouplex to count the number of words in an application that processes text files to count the frequency of each word. It uses the PySpark framework to distribute the computation across a Spark cluster, reading text files, splitting them into words, and counting occurrences of each word using map-reduce operations.

Download this Pipeline.

Configure the PySpark script Snap as follows:

Click the Edit PySpark script button and provide the following script. The script reads text files, splits lines into words using flatMap, maps each word to a tuple, reduces by key to count word frequencies, and collects the results to print word counts. The Spark application runs on Spark 3.5.6 with Hadoop 3 binaries.

import sys
from operator import add
from pyspark import SparkContext
if name == "main":
    if len(sys.argv) < 3:
        print >> sys.stderr, "Usage: wordcount <master> <file>"
        exit(-1)
    sc = SparkContext()
    lines = sc.textFile(sys.argv[2], 1)
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print "%s: %i" % (word, count)

Validate the Snap. On validation, the Snap displays the following output.

To successfully reuse pipelines:

Download and import the pipeline into SnapLogic.
Configure Snap accounts as applicable.
Provide pipeline parameters as applicable.