Redshift - Bulk Upsert

Overview

This Snap executes a Redshift bulk upsert. The Snap bulk updates the records if present, or, inserts records into the target table. Incoming documents are first written to a staging file on S3. A temporary table is created on Redshift with the contents of the staging file. An update operation is then run to update existing records in the target table and/or an insert operation is run to insert new records into the target table.

The COPY command is used to load the staging S3 file to the temporary table.

Recommended JDBC JAR Version

Use RedshiftJDBC42-1.2.10.1009.jar as the JDBC JAR version in the Redshift Account (JDBC jars property) when using this Snap

This is a Write-type Snap.
Works in Ultra Tasks

Prerequisites

A valid Redshift Account with S3 properties with the required permissions.
IAM Roles for Amazon EC2
The 'IAM_CREDENTIAL_FOR_S3' feature is used to access S3 files from EC2 Groundplex, without Access-key ID and Secret key in the AWS S3 account in the Snap. The IAM credential stored in the EC2 metadata is used to gain access rights to the S3 buckets. To enable this feature, set the Global properties (Key-Value parameters) and restart the JCC:
```
jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE
```
This feature is supported in the EC2-type Groundplex only. Learn more.

Limitations

If you open an error view and expect to have all failed records routed to the error view, you must increase the Maximum error count property.
If the number of failed records exceeds the Maximum error count, the pipeline execution will fail with an exception thrown and the failed records will not be routed to the error view.
If all values for the columns in an input document are null, it will be routed to the error view before it is written to S3, and this error is not counted as part of the Maximum error count.

Snap views


Type	Description	Examples of upstream and downstream Snaps
Input	This Snap has one input view for the data and a second optional input view for the target table schema.	Mapper JSON Generator File Reader
Output	This Snap has at most one output view.
Learn more about Error handling.

Examples

Update or Insert Records Using Bulk Upsert: Update or insert records using bulk upsert operation

Snap settings

Note: Learn about the common controls in the Snap settings dialog.


Field/Field set	Description
Label `String`	Required. Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: Redshift - Bulk Upsert Example: Redshift - Bulk Upsert
Schema name `String/Expression/ Suggestion`	The database schema name. Selecting a schema filters the Table name list to show only those tables within the selected schema. Warning: The values can be passed using the pipeline parameters but not the upstream parameter. Default value: [None] Example: public
Table name `String/Expression/ Suggestion`	Required. Table on which to execute the bulk load operation. Warning: The values can be passed using the pipeline parameters but not the upstream parameter. Default value: [None] Example: employees
Key columns `String/Expression/ Suggestion`	Required. Columns to use to check for existing entries in the target table. Default value: None
Validate input key value `Checkbox`	If selected, all duplicates and null key-column values in the input data will be written to the error view and the bulk upsert operation will stop. Or else, duplicates are inserted into the target table unless same duplicate rows already exist in the target table and null key-column values may cause unexpected result. The detection of duplicates is performed after all data is copied to S3 and then to a temporary table in Redshift and before the data is updated or inserted into the target table. Any two input documents with the same values for all key columns are considered 'duplicates'. If unchecked, duplicates in the input data will be inserted into the target table unless one or more duplicates already exist in the target table. Please note that Redshift allows duplicate rows to be inserted regardless of primary columns or key columns. Default value: Not selected
Truncate data `Checkbox`	Truncate existing data before performing data load. With the Bulk Update Snap, instead of doing truncate and then update, a Bulk Insert would be faster. Default value: Not Selected
Update statistics `Checkbox`	Update table statistics after data load by performing an Analyze operation on the table. Default value: Not selected
Accept invalid characters `Checkbox`	Accept invalid characters in the input. Invalid UTF-8 characters are replaced with a question mark when loading. Default value: Selected
Maximum error count `Integer`	Required. The Maximum number of rows which can fail before the bulk load operation is stopped. Default value: 100 Example: 10 (if you want the pipeline execution to continue as far as the number of failed records is less than 10)
Truncate columns `Checkbox`	Truncate column values which are larger than the maximum column length in the table. Default value: Selected
Disable data compression `Checkbox`	Disable compression of data being written to S3. Disabling compression will reduce CPU usage on the Snaplex machine, at the cost of increasing the size of data uploaded to S3. Default value: Not selected
Load empty strings `Checkbox`	If selected, empty string values in the input documents are loaded as empty strings to the string-type fields. Otherwise, empty string values in the input documents are loaded as null. Null values are loaded as null regardless. Default value: Not selected
Additional options `String/Expression`	Additional options to be passed to the COPY command. Check http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html for available options. The COPY command is used to load the staging S3 file to the temporary table. Default value: [None] Example: Date format can be specified as DATEFORMAT 'MM-DD-YYYY'
Parallelism `Integer`	Defines how many files will be created in S3 per execution. If set to 1 then only one file will be created in S3 which will be used for the copy command. If set to n with n > 1, then n files will be created as part of a manifest copy command, allowing a concurrent copy as part of the Redshift load. The Snap itself will not stream concurrent to S3. It will use a round robin mechanism on the incoming documents to populate the n files. The order of the records is not preserved during the load. Default value: [None]
Instance type `Dropdown list`	Appears when the Parallelism value is greater than 1. Select the type of instance from the following options: Default: Processes the default instance. High-performance S3 upload optimized: Processes the AWS high-performance EC2 instance such as R6a. Note: The High-performance S3 upload optimized option improves the Snap's performance when using an AWS EC2 R6a instance. Note: When you select the High-performance S3 upload optimized option for the Instance type, the Snap might increase the number of threads depending on the Parallelism property. In these cases, we recommend that you do not execute too many pipelines concurrently. Default value: Default Example: High-performance S3 upload optimized
IAM role `Checkbox`	This property enables you to perform the bulk load using IAM role. If this option is selected, ensure that the AWS account ID, role name and region name are provided in the account. Default value: Not selected
Server-side encryption `Checkbox`	This defines the S3 encryption type to use when temporarily uploading the documents to S3 before the insert into the Redshift. Default value: Not selected
KMS Encryption type `Dropdown list`	Specifies the type of KMS S3 encryption to be used on the data. The available encryption options are: None - Files do not get encrypted using KMS encryption Server-Side KMS Encryption - If selected, the output files on Amazon S3 are encrypted using this encryption with Amazon S3 generated KMS key. Note: If both the KMS and Client-side encryption types are selected, the Snap gives precedence to the SSE, and displays an error prompting the user to select either of the options only. Default value: None
KMS key `String/Expression`	Conditional. This property applies only when the encryption type is set to Server-Side Encryption with KMS. This is the KMS key to use for the S3 encryption. For more information about the KMS key, refer to AWS KMS Overview and Using Server Side Encryption. Default value: [None]
Vacuum type `Dropdown list`	Reclaims space and sorts rows in a specified table after the upsert operation. The available options to activate are FULL, SORT ONLY, DELETE ONLY and REINDEX. Refer to the AWS document on "Vacuuming Tables" for more information. Note: Auto-commit needs to be enabled for Vacuum. Default value: NONE
Vacuum threshold (%) `Integer`	Specifies the threshold above which VACUUM skips the sort phase. If this property is left empty, Redshift sets it to 95% by default. Default value: [None]
Encryption type `Dropdown list`	This defines the S3 encryption type to use when temporarily uploading the documents to S3 before the insert into Redshift. One of the following three options can be selected from the drop-down menu: None Server-Side Encryption - Choose this for encrypting output files on Amazon S3 using the default server-side encryption. Server-Side Encryption with KMS - Choose this for encrypting output files on Amazon S3 using server-side encryption with KMS key. Default value: None
Snap execution `Dropdown list`	Select one of the three modes in which the Snap executes. Available options are: Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime. Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data. Disabled: Disables the Snap and all Snaps that are downstream from it. Note: When enabled, the SOAP request will be executed and if the Snap has an output view defined, then the response will be written to the output view of the Snap. Default value: Execute only Example: Validate & Execute

Troubleshooting

None.