Shuffle

To randomize the order of the rows in the incoming dataset

Overview

The Shuffle Snap is a Flow type Snap that enables you to randomize the order of the rows in an incoming dataset. The Snap can be optimized to work with large datasets by configuring the maximum percentage of memory that can be used to buffer the dataset, if that limit is exceeded then the dataset is downloaded to a temporary file in local storage. A random integer can be assigned as the seed value, a seed value is any number that acts as the identifier for a particular randomized order. The Snap produces the same randomized order for the same seed value. This Snap, along with the Sample Snap, is helpful in randomizing a sample dataset for further analysis.


Shuffle Overview

Prerequisites

None.

Limitations and known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input This Snap has exactly one input document containing row data.
Output This Snap has exactly one output document with randomized row order.
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Legend:
  • Expression icon (): JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
  • SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
  • Suggestion icon (): Populates a list of values dynamically based on your Account configuration.
  • Upload : Uploads files. Learn more.
Learn more about the icons in the Snap settings dialog.
Field / field set Type Description
Label String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Shuffle

Example: Shuffle input rows
Use Random Seed Checkbox When selected, it ensures reproducibility by setting a fixed random seed for model training and evaluation.

Default status: Deselected

Random Seed Integer/Expression Required. Specify the number used for initializing the random number generator to ensure reproducibility in model training.

Default value: 12345

Example: 500

Maximum memory % Integer/Expression

Required. The maximum portion of the node's memory, as a percentage, that can be utilized to buffer the incoming dataset. If this percentage is exceeded then the dataset is written to a temporary local file and then the sample generated from this temporary file.

This configuration is useful in handling large datasets without over-utilization of the node memory. The minimum default memory to be utilized by the Snap is set at 100 MB.

Default value: 10

Snap execution Dropdown list
Select one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Execute only

Example: Validate & execute

Temporary files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder in Configuration Options.