Profile

Overview

You can use this Snap to compute statistics on the incoming data and derive a statistical analysis of the data in data sets. Each field can be either numerical or categorical. You can use the Type Converter Snap to appropriately change the data type.


ML Analytics Profile Snap

Prerequisites

  • The input document cannot have a nested structure.

Limitations and known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input This Snap supports a maximum of one document input view. It requires the data set as an input.
Output This Snap supports a maximum of two document output views. Statistical details of the data set. Computation is different based on the type of fields.
  • For categorical fields, the following is computed:
    • popular: The most popular value
    • total: The total number of documents in the dataset
    • unique values: The number of unique values
    • missing values: The number of whitespaces, null values, and missing values
    • value distribution: The distribution of values. It is presented with value-frequency pairs. This is not shown if the Value distribution property is not selected.
  • For numerical fields, the following is computed:
    • mean: Average value
    • min: Minimum value
    • max: Maximum value
    • sd: Standard deviation
    • popular: The bin with the highest number of data (if binning is enabled) or the most popular value (if binning is disabled).
    • total: The total number of documents in the data set.
    • unique values: The number of unique values.
    • missing values: The number of missing values.
    • value distribution: The distribution of bins (if number of bins is greater than 0) or values (if number of bins is 0). It is presented with bin/value-frequency pairs. This is not shown if the Value distribution property is not selected.

Second Output view: When enabled, this view outputs an HTML file that is a graphical visualization of the first output. If you select the Value distribution property, the value distribution of each class is also included in the output. Select this checkbox to view the statistics in a graph and produce an HTML file that displays a graph of the first output.

Mapper
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Note:
  • Suggestion icon (): Indicates a list that is dynamically populated based on the configuration.
  • Expression icon (): Indicates whether the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.
  • Add icon (Plus Icon): Indicates that you can add fields in the field set.
  • Remove icon (Minus Icon): Indicates that you can remove fields from the field set.
Field / Field set Type Description
Label String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Profile

Example: Customer data
Value distribution Checkbox

Select this checkbox to include the value distribution of the fields in the output.

Default status: Selected

Top values limit Integer/Expression Required. Specify the limit of the number of value-frequency pairs in the value distribution.
Note:
  • This field is applicable only to the categorical fields. However, if binning is disabled, this field is also applied to the numerical fields.
  • This field limits the number of value-frequency pairs in the value distribution. For example, if the value is 2, then the Snap lists two most-popular values in the data set with the number of documents with those values.
  • Set to 0 to include all values.

Default value: 100

Example: 200

Number of bins Integer/Expression Required. Specify the number of bins. Binning is a method of splitting the data space into equally sized ranges where N is the number of bins.
Note:
  • Applies only to numerical fields.
  • Set to 0 to disable binning.

Default value: 10

Example: 20

Maximum memory % Integer/Expression Required. Specify the maximum percentage of the node's memory that is used to buffer the incoming data set.
Note:
  • If the value is exceeded, then the data set is written to a temporary local file. This configuration is useful in handling large data sets without over-utilization of the node memory.
  • The minimum default memory to be used by the Snap is set to 100 MB.

Default value: 10

Example: 20

Snap execution Dropdown list
Select one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Validate & Execute

Example: Execute only

Temporary files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder.

Examples