Redshift - S3 Upsert

Overview

The Redshift - S3 Upsert Snap performs upsert operations using Amazon S3 as a staging area. This Snap is optimized for large-scale data upsert operations.

Prerequisites

  • A valid Redshift Account with S3 properties with the required permissions.
  • IAM Roles for Amazon EC2

    The 'IAM_CREDENTIAL_FOR_S3' feature is used to access S3 files from EC2 Groundplex, without Access-key ID and Secret key in the AWS S3 account in the Snap. The IAM credential stored in the EC2 metadata is used to gain access rights to the S3 buckets. To enable this feature, set the Global properties (Key-Value parameters) and restart the JCC:

    jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

    This feature is supported in the EC2-type Groundplex only. Learn more.

Limitations

  • The S3 Bucket, S3 Access-key ID, and S3 Secret key properties are required for this Snap.
  • The S3 Folder property may be used for the staging file. If the S3 Folder property is left blank, the staging file will be stored in the bucket.
  • If you open an error view and expect to have all failed records routed to the error view, you must increase the error count value using the Maximum error count field.

Snap views

Type Description Examples of upstream and downstream Snaps
Input

This Snap has one input view for the data and a second optional input view for the target table schema.

Output

This Snap has at most one output view.

Learn more about Error handling.

Account & Access

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. The S3 Bucket, S3 Access-key ID, and S3 Secret key properties are required for the Redshift- S3 Upsert Snap. The S3 Folder property may be used for the staging file. If the S3 Folder property is left blank, the staging file will be stored in the bucket. See Redshift Account for information on setting up this type of account.

Examples

Examples for this Snap are coming soon.

Account & Access

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. The S3 Bucket, S3 Access-key ID, and S3 Secret key properties are required for the Redshift- S3 Upsert Snap. The S3 Folder property may be used for the staging file. If the S3 Folder property is left blank, the staging file will be stored in the bucket. See Redshift Account for information on setting up this type of account.

Redshift IAM Account Setup

If the EC2 plex (where your pipeline is running with an IAM role), Redshift cluster, and S3 bucket are in the same AWS account, then you must use the SnapLogic Redshift Account (regular IAM Account).

If the EC2 plex (where your pipeline is running with an IAM role) is in one account and the Redshift cluster and S3 bucket are in a different AWS account, you must use the SnapLogic Redshift Cross-Account IAM Role Account to run your pipelines successfully.

This applies only to the Redshift - Bulk Load, Redshift - Unload, and Redshift - S3 Upsert Snaps.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.
Field/Field set Description

Label

String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Redshift - S3 Upsert

Example: Redshift - S3 Upsert

Schema name

String/Expression/ Suggestion

Required.

The database schema name. Selecting a schema filters the Table name list to show only those tables within the selected schema.

Warning: The values can be passed using the pipeline parameters but not the upstream parameter.

Default value: [None]

Example: schema123

Table name

String/Expression/ Suggestion

Required.

Table on which to execute the upsert operation. The property can be given in format of either <schema>.<table_name> or <table_name>. It is suggestible and will retrieve available tables under the schema (if given) during suggest values.

Warning: The values can be passed using the pipeline parameters but not the upstream parameter.

Default value: [None]

Example:

  • people
  • "public"."people"

Key columns

String/Expression/ Suggestion

Required.

Columns to use to check for existing entries in the target table.

Default value: [None]

Example: id

S3 file list

String/Expression

Required.

List of S3 files to be loaded into the target table as file names or as expressions.

Default value:

Example: s3:///testing/testtablefors3.csv

IAM Role

Checkbox

Select this property if the bulk load/unload needs to be performed using an IAM role. If selected, ensure the properties (AWS account ID, role name and region name) are provided in the account.

Default value: Not selected

Server-side encryption

Checkbox

This defines the S3 encryption type to use when temporarily uploading the documents to S3 before the insert into the Redshift.

Default value: Not selected

KMS Encryption type

Dropdown list

Specifies the type of KMS S3 encryption to be used on the data. The available encryption options are:

  • None - Files do not get encrypted using KMS encryption
  • Server-Side KMS Encryption - If selected, the output files on Amazon S3 are encrypted using this encryption with Amazon S3 generated KMS key.
Note: If both the KMS and Client-side encryption types are selected, the Snap gives precedence to the SSE, and displays an error prompting the user to select either of the options only.

Default value: None

KMS key

String/Expression

Conditional. This property applies only when the encryption type is set to Server-Side Encryption with KMS. This is the KMS key to use for the S3 encryption. For more information about the KMS key, refer to AWS KMS Overview and Using Server Side Encryption.

Default value: [None]

Truncate data

Checkbox

Truncate existing data before performing data load.

Note: With the Bulk Update Snap, instead of doing truncate and then update, a Bulk Insert would be faster.

Default value: Not selected

Update statistics

Checkbox

Update table statistics after data load by performing an analyze operation on the table.

Default value: Not selected

Accept invalid characters

Checkbox

Accept invalid characters in the input. Invalid UTF-8 characters are replaced with a question mark when loading.

Default value: Selected

Maximum error count

Integer

Required.

A maximum number of rows which can fail before the bulk load operation is stopped. By default, the load stops on the first error.

Default value: 100

Example: 10 (if you want the pipeline execution to continue as far as the number of failed records is less than 10)

Truncate columns

Checkbox

Truncate column values which are larger than the maximum column length in the table.

Default value: Selected

Load empty strings

Checkbox

If selected, empty string values in the input documents are loaded as empty strings to the string-type fields. Otherwise, empty string values in the input documents are loaded as null. Null values are loaded as null regardless.

Default value: Not selected

Compression format

Dropdown list

The format in which the provided S3 files are compressed in. Specifies:

  • Uncompressed
  • GZIP
  • BZIP2
  • LZOP

Default value: Uncompressed

Example: GZIP

File type

Dropdown list

The type of input files. Specifies:

  • CSV
  • JSON
  • ARVO
  • Undefined

Default value: CSV

Example: JSON

Ignore header

Integer

Required.

Treats the specified number of rows as file headers and does not load them.

Default value: 0

Example: 1

Delimiter

String/Expression

The single ASCII character that is used to separate fields in the input file, such as a pipe character ( | ), a comma (, ), or a tab ( \t ). Non-printing ASCII characters are supported. ASCII characters can also be represented in octal, using the format '\ddd', where 'd' is an octal digit (0–7). The default delimiter is a pipe character ( | ), unless the CSV parameter is used, in which case the default delimiter is a comma (, ). The AS keyword is optional. DELIMITER cannot be used with FIXEDWIDTH.

Default value: pipe character ( | )

Example: ,

Additional options

String/Expression

Additional options to be passed to the COPY command.

Note: Refer to AWS Amazon - COPY documentation for available options.

Default value: [None]

Example: ACCEPTANYDATE

Vacuum type

Dropdown list

Reclaims space and sorts rows in a specified table after the upsert operation. The available options to activate are FULL, SORT ONLY, DELETE ONLY and REINDEX. Please refer to the AWS Amazon - VACUUM documentation for more information.

Default value: [None]

Example: FULL

Vacuum threshold (%)

Integer

Specifies the threshold above which VACUUM skips the sort phase. If this property is left empty, Redshift sets it to 95% by default.

Default value: [None]

Snap execution

Dropdown list

Select one of the three modes in which the Snap executes. Available options are:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Execute only

Example: Validate & Execute

Redshift's Vacuum Command

In Redshift, when rows are DELETED or UPDATED against a table they are simply logically deleted (flagged for deletion), not physically removed from disk. This causes the rows to continue consuming disk space and those blocks are scanned when a query scans the table. This results in an increase in table storage space and degraded performance due to otherwise avoidable disk IO during scans. A vacuum recovers the space from deleted rows and restores the sort order.

Groundplex System Clock and Multiple Snap Instances with the same 'S3 file list' property

The system clock of the Goundplex should be accurate down to less than a second. The Snap executes Redshift COPY command to have Redshift load CSV data from S3 files to a temporary table created by the Snap. If Redshift fails to load any record, it stores the error information for each failed CSV record into the system error table in the same Redshift database. Since all errors from all executions go to the same system error table, the Snap executes a SELECT statement to find the errors related to a specific COPY statement execution. It uses WHERE clause including the CSV filenames, start time and end time. If the system clock of Groundplex is not accurate down to less than a second, the Snap might fail to find error records from the error table.

If multiple instances of the Redshift - S3 Upsert Snap have the same S3 file list property value and execute almost same time, the Snap will fail to report correct error documents in the error view. Users should make sure each Redshift - S3 Upsert Snap instance with the same S3 file list executes one at a time.

Troubleshooting

Error Reason Resolution
type "e" does not exist This issue occurs due to incompatibilities with the recent upgrade in the Postgres JDBC drivers. Download the latest 4.1 Amazon Redshift driver here and use this driver in your Redshift Account configuration and retry running the Pipeline.