PySpark

Overview

You can use this Snap to execute a PySpark script. It formats and executes a 'spark-submit' command in a command line interface, and then monitors the execution status.

This is a Write-type Snap.
Works in Ultra Tasks

Prerequisites

The Snap must be executed in a Groundplex on a Spark cluster node or an edge node.

Snap views


Type	Description	Examples of upstream and downstream Snaps
Input	If the upstream Snap is connected, this Snap executes after each input document and produces a document in the output view or an error document in the error view. Each input document is used to evaluate expression properties in the Snap.	JSON Generator
Output	If the script executes successfully with an exit code 0, the Snap produces output documents with the status. If the script is coded to produce a standard output, it is also included in the output document. It produces one output document for each execution of the PySpark script. If the script fails (with an exit code other than 0), the Snap produces an error document in the error view.	Mapper
Learn more about Error handling.

Snap settings

Legend:

Expression icon (): Allows using pipeline parameters to set field values dynamically (if enabled). SnapLogic Expressions are not supported. If disabled, you can provide a static value.
SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
Suggestion icon (): Populates a list of values dynamically based on your Snap configuration. You can select only one attribute at a time using the icon. Type into the field if it supports a comma-separated list of values.
Upload : Uploads files. Learn more.

Learn more about the icons in the Snap settings dialog.


Field/Field set	Type	Description
Label	String	Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: PySpark Example: PySpark
Spark home	String/Expression	Specify the Spark home directory where spark-submit command is located under the bin/ subdirectory. If this property is empty, the Snap tries to find a value for "SPARK_HOME" or "CDH_SPARK_HOME" in the environment variables or system properties. Default value: None. Example: /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/spark
Spark Submit Command	Dropdown list	Specify the Spark home directory where spark-submit command is located under the bin/ subdirectory. If this property is empty, the Snap tries to find a value for "SPARK_HOME" or "CDH_SPARK_HOME" in the environment variables or system properties. Choose the Spark command to run your PySpark application on a cluster. The available options are: spark-submit: This is the generic command-line tool to submit applications to Apache Spark. spark2-submit: Specifically refers to the Apache Spark 2.x binary. Use this option when your pipeline relies on APIs or behaviors specific to Spark 2. spark3-submit: Specifically refers to Apache Spark 3.x. Use this option when you want to take advantage of the new features introduced in Spark 3, such as adaptive query execution, new SQL functions, and improved ANSI compliance. Note: Ensure the program is accessible from the Spark Home under the bin folder. Learn more about the Spark submit command. Default value: spark-submit Example: spark2-submit
Spark submit args	String/Expression	Specify the arguments for the spark-submit command, if any. Default value: None. Example: $sparkSubmitArgs _sparkSubmitArgs --master yarn --deploy-mode cluster (to submit the PySpark script to YARN) --principal snaplogic/[email protected] --keytab /snaplogic.keytab.new (to submit the PySpark script to Kerberos-enabled cluster)
Edit PySpark script	Button	This property enables you to edit a PySpark script. A 'word-count' sample script is included with the Snap. Click to open an editor and save. To try the sample script, enter a file path to an input text file in the Script args property. In the script editor, a script can be exported, imported, or a template can be generated as required. Learn more: RDD Programming Guide - Spark 4.0.0 Documentation
Script args	String/Expression	Specify the arguments for the PySpark script. Default value: None. Example: hdfs:///tmp/sample.txt hdfs:///tmp/output.dir/ (input file and output directory for 'wordcount' sample script)
YARN RM (host:port)	String/Expression	Specify the hostname and port number of the Yarn Resource Manager in the 'host:port' format. This property is required to stop a PySpark job in progress. Note: If YARN is not used to submit the PySpark script, then stopping the Snap will not halt the job submitted to Spark. Default value: None. Example: rm01.hadoop.cluster:8032
Timeout (sec)	Integer/Expression	Specify the timeout limit in seconds. If negative or empty, the Snap will not time out until spark-submit returns the result. Default value: -1 Example: 600 (10 minutes)
Snap execution	Dropdown list	Choose one of the three modes in which the Snap executes. Available options are: `Validate & Execute`: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime. `Execute only`: Performs full execution of the Snap during pipeline execution without generating preview data. `Disabled`: Disables the Snap and all Snaps that are downstream from it. Default value: Execute only Example: Validate & Execute

Troubleshooting

The Snap produces an error document if a given PySpark script fails to execute. It may be helpful in troubleshooting to execute the script in a command line interface of a Groundplex where the Snap is executed.

Examples

Execute PySpark script