ETL pipeline: Extract from MySQL, load to Redshift, and delete records

This example demonstrates a complete ETL workflow: extracting records from a MySQL database, loading them into Redshift using Redshift - Bulk Load, deleting specific records based on conditions, and retrieving the remaining records with Redshift - Select. This pattern is common in data migration and cleansing scenarios.

The pipeline extracts data from MySQL, transforms it using a Mapper, bulk loads it into Redshift via S3 staging, deletes unwanted records, and verifies the results with a Select operation.

  • A valid MySQL account
  • A valid Redshift account configured
  • An Amazon S3 account for staging files (required by Redshift Bulk Load)
  • A source MySQL table with records to migrate
  • A target Redshift table (can be created automatically by the Bulk Load Snap)
  • Appropriate permissions for read operations on MySQL and write/delete operations on Redshift
  1. Configure the MySQL Select Snap to extract source data.

    Set the following properties:

    • Schema name: Specify the MySQL database schema.
    • Table name: Specify the source table to extract data from.
    • Configure any filters or WHERE clauses to limit the records extracted.
  2. Configure the Mapper Snap to transform data.

    Map fields from the MySQL source format to the Redshift target format. This step handles any data type conversions or field name transformations required.

  3. Configure the Redshift - Bulk Load Snap.

    Set the following properties:

    • Schema name: Specify the Redshift schema (for example, public).
    • Table name: Specify the target Redshift table.
    • Create table if not present: Select this option to automatically create the table if it doesn't exist.
    • S3 folder: Specify the S3 staging location for temporary files.
  4. Configure the Redshift - Delete Snap to remove unwanted records.

    Set the following properties:

    • Schema name: Specify the same Redshift schema used in the Bulk Load.
    • Table name: Specify the same table name.
    • Delete condition: Specify a WHERE clause to delete specific records. For example, status = 'inactive' or created_date < '2023-01-01'.
  5. Configure the Redshift - Select Snap to verify results.

    Set the following properties:

    • Schema name: Specify the same Redshift schema.
    • Table name: Specify the same table name.
    • Optionally add filters to view specific subsets of the remaining data.
  6. Execute the pipeline.

    The pipeline executes the following workflow:

    1. Extracts records from MySQL
    2. Transforms data using the Mapper
    3. Bulk loads data into Redshift via S3 staging
    4. Deletes specific records based on the condition
    5. Retrieves and outputs the remaining records

The pipeline successfully completes the ETL workflow with data cleansing. The output includes:

  • Records loaded into Redshift from MySQL
  • Status from the Delete operation showing the number of records removed
  • Final dataset from the Select operation showing the remaining records
Note: This pattern demonstrates how to combine multiple Redshift Snaps in a single pipeline for complex ETL workflows. The Bulk Load operation is significantly faster than individual inserts for large datasets, and the Delete operation allows for data cleansing as part of the same workflow.