Principal Component Analysis (PCA)

Perform Principal Component Analysis (PCA) on an input document

This Snap performs Principal Component Analysis (PCA) on an input document and outputs a document containing fewer dimensions (or columns). PCA is a dimension-reduction technique that can be used to reduce a large set of variables to a small set that still contains most of the information in the original set. In simple terms, PCA attempts to find common factors in a given dataset, and ranks them in order of importance. Therefore, the first dimension in the output document accounts for as much of the variance in the data as possible, and each subsequent dimension accounts for as much of the remaining variance as possible. Thus, when you reduce the number of dimensions, you significantly reduce the amount of data that the downstream Snap must manage, making it faster.

PCA is widely used to perform tasks such as data compression, exploratory data analysis, pattern recognition, and so on. For example, you can use PCA to identify patterns that can help you isolate specific species of flowers that are more closely related than others.

How does it work?

The PCA Snap performs two tasks:

  1. It analyzes data in the input document and creates a model that
    1. Reduces the number of dimensions in the input document to the number of dimensions specified in the Snap.
    2. Retains the amount of variance specified in the Snap.
  2. It runs the model created in the step above on the input data and emits a document containing the processed output, offering a simplified view of the data, making it easier for you to identify patterns in it.

Principal Component Analysis Snap Overview

  • Transform-type Snap
  • Works in Ultra Tasks only when the Snap has two input views and one output view.

Prerequisites

  • The input data must be in a tabular format.

Limitations

  • The PCA Snap does not work with data containing nested structures.

Known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input This Snap has at most two document input views.
  1. Required. A document containing data that has numeric fields.
  2. A document containing the model (or mathematical formula that performs a transformation on the input data) that you want the PCA Snap to use on the data coming in through the first input. If you do not provide the model, the PCA Snap builds a model that is best suited for the input data provided through the first input.
  1. First input view:
  2. Second input view:
Output This Snap has at most two document output views:
  1. A document containing transformed data with fewer (lower) dimensions.
  2. A document containing the model that the PCA Snap created and used on the input data. If you supply the Snap with the model (created using a PCA Snap earlier) that you want to use, the Snap does not output the model.
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Legend:
  • Expression icon (): JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
  • SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
  • Suggestion icon (): Populates a list of values dynamically based on your Account configuration.
  • Upload : Uploads files. Learn more.
Learn more about the icons in the Snap settings dialog.
Field / field set Type Description
Label String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Principal Component Analysis (PCA)

Example: PCA
Dimension String/Expression

Required. The maximum number of dimensions or columns that you want in the output.

Minimum value: 0

Maximum value: Undefined

Default value: 10

Variance String/Expression

Required. The minimum variance that you want to retain in the output documents.

Minimum value: 0

Maximum value: 1

Default value: 0.95

Pass through Checkbox Select this checkbox to include all the categorical input fields in the output.

Default status: Selected