Redshift - SCD2

Overview

This Snap provides the functionality of SCD (slowly changing dimension) Type 2 on a target Redshift table. The Snap executes one SQL lookup request per set of input documents to avoid making a request for every input record. Its output is typically a stream of documents for the Redshift - Bulk Upsert Snap, which updates or inserts rows into the target table. Therefore, this Snap must be connected to the Redshift - Bulk Upsert Snap to accomplish the complete SCD2 functionality.

ETL Transformations and Data Flow

This Snap enables the following ETL operations/flows:

  • Take the incoming document from the upstream snap and perform a lookup operation in the database, producing one or two documents in the output view.
  • If a record exists in the database (with the values provided in the input document), generate two output documents; otherwise generates only one.
  • Feed the output documents to the Redshift Bulk Upsert Snap, which will insert them into the destination table to preserve history.

Prerequisites

  • Redshift database installed.
  • Able to connect to the database from the desired plex nodes.
  • A valid Redshift Account with the required permissions.

Known Issues

If you use the PostgreSQL driver (org.postgresql.Driver) with the Redshift Snap Pack, it could result in errors if the data type provided to the Snap does not match the data type in the Redshift table schema. Either use the Redshift driver (com.amazon.redshift.jdbc42.Driver) or use the correct data type in the input document to resolve these errors.

Snap views

Type Description Examples of upstream and downstream Snaps
Input

This Snap has exactly one input view and expects documents in the view.

Output

This Snap allows zero or one output views and produces documents in the view.

Learn more about Error handling.

Examples

  1. Implement Slowly Changing Dimension Type 2: Implement Slowly Changing Dimension Type 2 with Redshift

Snap settings

Note: Learn about the common controls in the Snap settings dialog.
Field/Field set Description

Label

String

Required.

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Redshift - SCD2

Example: Redshift - SCD2

Schema name

String/Expression/ Suggestion

The name of the database schema that contains the table whose data you want to update. Selecting a schema filters the Table name list to only show tables created in the selected schema. If you do not specify a schema in the Schema name field, the Table name field lists out all tables with the name you specify in all schemas in the database.

Default value: N/A

Warning: The values can be passed using the pipeline parameters but not the upstream parameter fields.

Table name

String/Expression/ Suggestion

Required.

The name of the table that contains the data you want to update.

Default value: N/A

Warning: The values can be passed using the pipeline parameters but not the upstream parameter fields.

Natural key

String/Expression/ Suggestion

Required.

Names of fields that identify a unique row in the target table. The identity key cannot be used as the Natural key, since a current row and its historical rows cannot have the same natural key value.

Default value: N/A

Example: id (Each record has to have a unique value.)

Warning: The values can be passed using the pipeline parameters but not the upstream parameter fields.

Cause-historization fields

String/Expression/ Suggestion

Required.

Names of fields where any change in values causes the historization of an existing row and the insertion of a new 'current' row.

Default value: N/A

Example: gold bullion rate

SCD Fields

String/Expression/ Suggestion

Required.

Enter the field names you want to use as dimension fields–or select them from the suggestion list. The columns in the table specified in the Table name field above are used to populate the suggestion list.

Example:

Meaning Field Value
Current row current_row 1
Historical row current_row 0
Start date start_date $start_date
End date end_date $end_date

Default value:

Meaning Field Value
Current row [None] 1
Historical row [None] 0
Start date [None] Date.now()
End date [None] Date.now()

Ignore unchanged rows

Checkbox

Specifies whether the Snap must ignore writing unchanged rows from the source table to the target table. If you enable this option, the Snap generates a corresponding document in the target only if the Cause-historization column in the source row is changed. Else, the Snap does not generate any corresponding document in the target.

Default value: Not selected

Snap execution

Dropdown list

Select an option to specify how the Snap must be executed. Available options are:

  • Validate & Execute: Performs limited execution of the Snap (up to 50 records) during Pipeline validation; performs full execution of the Snap (unlimited records) during Pipeline execution.
  • Execute only: Performs full execution of the Snap during Pipeline execution; does not execute the Snap during Pipeline validation.
  • Disabled: Disables the Snap and, by extension, its downstream Snaps.

Default value: Validate & Execute

Example: Execute only

Troubleshooting

Error Reason Resolution
type "e" does not exist This issue occurs due to incompatibilities with the recent upgrade in the Postgres JDBC drivers. Download the latest 4.1 Amazon Redshift driver here and use this driver in your Redshift Account configuration and retry running the Pipeline.