Implement Slowly Changing Dimension Type 2

Overview

This example demonstrates how to use the Redshift SCD2 Snap to track historical changes in dimension tables using SCD Type 2 methodology. The Snap enables historization of changing data while preserving both current and historical records.

Prerequisites

Amazon Redshift cluster with appropriate access
Valid Redshift account configured in SnapLogic
Understanding of Slowly Changing Dimension concepts
Target table with appropriate SCD2 columns (start_date, end_date, active flag)

SCD Type 2 Concepts

Slowly Changing Dimension Type 2 (SCD2) is a technique for tracking historical changes in dimension data. When a record's data changes:

The existing record is marked as historical (active = false) and end-dated
A new record is inserted with the updated data (active = true) and a new start date
Both records share the same natural key but have different surrogate keys

This allows queries to access either current data or historical snapshots based on date ranges.

Pipeline Overview

The pipeline consists of two Snaps working together:

Redshift SCD2 Snap: Compares incoming data against the target table, identifying new and changed records
Redshift Bulk Upsert Snap: Performs the actual database operations to historize old records and insert new ones

Snap Configuration

Redshift SCD2 Snap - Settings:

Schema name: Target schema (e.g., public)
Table name: Target dimension table
Natural key: Columns that uniquely identify a business entity (e.g., station_id, zipcode)
Cause-historization fields: Columns whose changes trigger historization (e.g., premium, regular)
SCD Fields: Configure dimension tracking columns:
- Current row field: active (value: true for current, false for historical)
- Start date field: start_date (value: Date.now())
- End date field: end_date (value: Date.now() when historized, null for current)
Ignore unchanged rows: Optional - skip generating output for records with no changes

Redshift Bulk Upsert Snap - Settings:

Table name: Same as SCD2 Snap target table
Key columns: Must include at least one natural key field from SCD2 configuration

How It Works

When the pipeline executes:

The SCD2 Snap receives incoming data and performs a lookup in the target table using the natural key
For each input record:
- If no match found: Generates one output document for insertion as a new current record
- If match found and cause-historization fields changed: Generates two output documents:
  1. One to update the existing record (mark as historical, set end_date)
  2. One to insert the new record (mark as current, set start_date)
- If match found and no changes: Either skips or generates pass-through based on configuration
The Bulk Upsert Snap executes the updates and inserts generated by the SCD2 Snap

Example Data Flow

Initial Load:

Input: { "station_id": "U01", "zipcode": "94402", "premium": 4.13, "regular": 3.83 }
Output: One record marked as current (active=true, start_date=now, end_date=null)

After Price Update:

Input: { "station_id": "U01", "zipcode": "94402", "premium": 4.25, "regular": 3.98 }
Output: Two records:
  1. Update existing: active=false, end_date=now
  2. Insert new: active=true, start_date=now, end_date=null, premium=4.25, regular=3.98

Use Cases

This pattern is useful for:

Tracking price changes over time
Maintaining customer address history
Recording product attribute changes
Preserving organizational hierarchy changes
Any scenario requiring historical data snapshots

Best Practices

Choose natural keys that truly identify unique business entities
Select cause-historization fields carefully - only those whose changes matter
Ensure the target table has appropriate indexes on natural key columns
Use the Ignore unchanged rows option to reduce unnecessary processing
Test with small datasets to verify historization logic before full deployment
Consider data volume and retention policies when implementing SCD2