Snowflake - Bulk Load

Overview

You can use the Snowflake - Bulk Load Snap to load data from input sources or files stored on external object stores like Amazon S3, Google Storage, and Azure Storage Blob into the Snowflake data warehouse.


Bulk Load Snap Configuration

Prerequisites

You must have minimum permissions on the database to execute Snowflake Snaps. To understand if you already have them, you must retrieve the current set of permissions. The following commands enable you to retrieve those permissions:

SHOW GRANTS ON DATABASE <database_name>
          SHOW GRANTS ON SCHEMA <schema_name>
          SHOW GRANTS TO USER <user_name>
  • External volume has to be created on the Snowflake worksheet, or Snowflake Execute snap. Learn more about creating external volume
Security Prererequisites
You must have the following permissions in your Snowflake account to execute this Snap:
  • Usage (DB and Schema): Privilege to use the database, role, and schema.
  • Create table: Privilege to create a temporary table within this schema.
The following commands enable minimum privileges in the Snowflake console:
grant usage on database <database_name> to role <role_name>;
grant usage on schema <database_name>.<schema_name>;
 
grant "CREATE TABLE" on database <database_name> to role <role_name>;
grant "CREATE TABLE" on schema <database_name>.<schema_name>;

Learn more about Snowflake privileges: Access Control Privileges.

Internal SQL Commands
This Snap uses the following Snowflake commands internally:
  • COPY INTO - Enables loading data from staged files to an existing table.
  • PUT - Enables staging the files internally in a table or user stage.
Note: The following requirements are mandatory when using an external staging location:
  • Amazon S3 Bucket: The Snowflake account must include the S3 Access Key ID, S3 Secret Key, S3 Bucket Name, and S3 Folder Path.
  • Amazon S3 Bucket: The S3 bucket must be located in the same region as your Snowflake cluster.
  • Microsoft Azure Blob Storage: A valid and active Snowflake Azure database account is required.
  • Google Cloud Storage: Ensure the storage bucket has the correct public access settings and access control permissions configured on the Google Cloud Platform.

Limitations and Known Issues

  • The special character ~ is not supported in the temporary directory name on Windows. It is reserved for the user's home directory.
  • Snowflake supports cross-account access through Storage Integration. This method allows you to use Cross Account IAM in external staging without passing credentials explicitly. Access is granted via the named stage or integration object. Learn more about configuring Cross Account IAM role support.
  • The Snowflake Bulk Load Snap expects the column order in the input data to match the target table's structure. Mismatched column order will cause data validation to fail.
  • If the Snowflake Bulk Load Snap fails due to insufficient memory on the JCC node while using an Input View as the data source and Internal Stage as the staging location, consider switching to an external staging location such as Amazon S3, Azure Blob, or Google Cloud Storage.
  • When a bulk load operation fails due to invalid input that lacks default columns, the error view may not display the incorrect columns properly. This is a known Snowflake issue tracked under JIRA SNOW-662311 and JIRA SNOW-640676.
  • This Snap does not support creating Iceberg tables with an external catalog in Snowflake. Currently, Snowflake only allows read-only access to such tables and does not support write capabilities. Learn more about Iceberg catalog options
  • Snowflake does not support cross-cloud or cross-region Iceberg tables when using Snowflake as the Iceberg catalog. If the Snap displays an error such as External volume <volume_name> must have a STORAGE_LOCATION defined in the local region, ensure that the external volume uses a storage location in the same region as your Snowflake account.

Behavior change

Starting from 4.35 GA, if your Snaplex is behind a proxy and the Snowflake Bulk Load Snap uses the default Snowflake driver (3.14.0 JDBC driver), then the Snap might encounter a failure. There are two ways to prevent this failure:
  1. Add the following key-value pair in the Global properties section of the Node Properties tab: Key: jcc.jvm_optionsValue: -Dhttp.useProxy=true
  2. Add the following key-value pairs in the URL properties of the Snap under Advanced properties.
Note: We recommend you use the second approach to prevent the Snap’s failure.

Known Issues

Known Issue: :The create table operation fails if it contains a geospatial data type column.

Workaround: :Create the table manually ahead of time or use the Snowflake - Execute to create the table.

Because of performance issues, all Snowflake Snaps now ignore the Cancel queued queries when pipeline is stopped or if it fails option for Manage Queued Queries, even when selected. Snaps behave as though the default Continue to execute queued queries when the Pipeline is stopped or if it fails option were selected.

Snap views

View Description Examples of upstream and downstream Snaps
Input Documents containing the data to be uploaded to the target location.
Note: Second Input View

This Snap has one document input view by default.

You can add a second input view for metadata for the table as a document so that when the target table is absent, this table metadata can be created in the database with a similar schema as the source table. This schema is usually from the second output of a database Select Snap. If the schema is from a different database, the data types might not be properly handled.

Learn more about adding metadata for the table in the second input view from the example Providing Metadata For Table Using The Second Input View.
Output If an output view is available, then the output document displays the number of input records and the status of the bulk upload as follows:
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Legend:
  • Expression icon (): Allows using JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
  • SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
  • Suggestion icon (): Populates a list of values dynamically based on your Snap configuration. You can select only one attribute at a time using the icon. Type into the field if it supports a comma-separated list of values.
  • Upload : Uploads files. Learn more.
Learn more about the icons in the Snap settings dialog.
Field / Field set Type Description
Label String Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if there are more than one of the same Snap in the pipeline.
Schema Name String/Expression/ Suggestion Required. Specify the database schema name. In case it is not defined, then the suggestion for the Table Name retrieves all tables names of all schemas. The property is suggestible and will retrieve available database schemas during suggest values.
Note: The values can be passed using the Pipeline parameters but not the upstream parameter.

Default value: N/A

Example: Schema_demo

Table Name String/Expression/ Suggestion Required. Specify the name of the table to execute bulk load operation on.
Note: The values can be passed using the Pipeline parameters but not the upstream parameter.

Default value: N/A

Example: employees_table

Create table if not present Checkbox Select this checkbox to automatically create the target table if it does not exist.
Note:
  • The data type for the columns in the new table depends on the data type of the input in the upstream Snap. If a second input view exists the data type for the columns is read from the input view.
  • Due to implementation details a newly created table is not visible to subsequent database Snaps during runtime validation. If you want to immediately use the newly updated data you must use a child Pipeline that is invoked through a Pipeline Execute Snap.
  • Create table if not present works only when the set Datasource is Inputview. When you select Create table if not present and the Datasource is set to Staged files, the Snowflake Bulk Load Snap throws the configuration exception "Failure: Invalid snap configuration" because this Snap does not support table creation when you upload existing files. You can create a table if a table does not exist and retry.
  • If you provide a second input view and select Create table if not present when the target table does not exist, the target table is only created from the metadata of the second input view and not from the input document.
Note: This should not be used in production since there are no indexes or integrity constraints on any column and the default varchar() column is over 30k bytes.
Note: The create table operation fails if it contains a geospatial data type column. Workaround: Create the table manually ahead of time, or use the Snowflake - Execute to create the table.

Default value: N/A

Example: employees_table

Iceberg table Checkbox Appears when you select Create table if not present. Select this checkbox to create an Iceberg table with the Snowflake catalog. Learn more about how to create and Iceberg table with Snowflake as the Iceberg catalog.

Default status: deselected

External volume String/Expression/ Suggestion Required. Appears when you select the Iceberg table.Specify the external volume for the Iceberg table. Learn more about how to configure an external volume for Iceberg tables.

Default value: N/A

Example: employees

Base location String/Expression Required. Appears when you select the Iceberg table checkbox. Specify the Base location for the Iceberg table.
Note: The base location is the relative path from the external volume.

Default value: N/A

Example: employees

Data source Dropdown list Specify the source from where the data should load. The available options are Input view and Staged files.
Note: When the option Input View is selected, leave the Table Columns field empty, and if the Staged files option is selected, provide the column names for the Table Columns to which the records are to be added.

Default value: Input view

Example: Staged files

Preserve case sensitivity Checkbox
Select this checkbox to preserve the case sensitivity of the column names.
  • If you do not select Preserve case sensitivity, the input documents are loaded to the target table if the key names in the input documents match the target table column names ignoring the case.
  • If you include a second input view, selecting Preserve case sensitivity has no effect on the column names of the target table, because Snap uses the metadata from the second input view.

Default status: deselected

Load empty strings Checkbox
Select this checkbox to load empty string values in the input documents as empty strings to the string-type fields. Else, empty string values in the input documents are loaded as null. Null values are loaded as null regardless.
  • When you select this checkbox, the null values are treated as ,^_, to optimize the size of the temporary files stored in the staging location before loading to Snowflake. Therefore, if any of the columns in the source table contain this string (^_) the output is displayed as NULL in the target table on selecting this checkbox.
  • When you deselect this checkbox, null values are treated as empty strings.

Default status: deselected

Truncate data Checkbox Select this checkbox to truncate existing data before performing data load. With the Bulk Update Snap, instead of doing truncate and then update, a Bulk Insert would be faster.

Default status: deselected

Staging location Dropdown list/Expression Select the type of staging location that is to be used for data loading:
  • External: Location that is not managed by Snowflake. The location should be an AWS S3 Bucket or Microsoft Azure Storage Blob or Google Cloud Storage. These credentials are mandatory while validating the Account.
  • Internal: Location that is managed by Snowflake.
Note: The Snap creates temporary files in JCC when the Staging location is internal and the Data source is input view. These temporary files are removed automatically once the Pipeline completes execution.

Vector support:

  • This Snap supports loading data in vector data type only when the Staging location is Internal.
  • The input vector data must be an array.
  • If you use a SELECT query, you must parse the vector.

If you want to delete the temporary files from the S3 Bucket, we recommend you assign the delete object permission policy to the S3 user to delete the files. Learn how to assign delete object permission to an S3 user in AWS S3. If you do not want to delete the temporary files, you can add an error view to the Snap and run the pipeline.

Default value: Internal

Example: External

Flush chunk size (in bytes) String/Expression Appears when you select Input view for Data source and Internal for Staging location.

When using internal staging, data from the input view is written to a temporary chunk file on the local disk. When the size of a chunk file exceeds the specified value, the current chunk file is copied to the Snowflake stage and then deleted. A new chunk file simultaneously starts to store the subsequent chunk of input data. The default size is 100,000,000 bytes (100 MB), which is used if this value is left blank.

Default value: N/A

Example: 400000000

Target String/Expression Specify an internal or external location to load the data. If you select External for Staging Location, a staging area is created in Azure, GCS, or S3 as configured. Otherwise, a staging area is created in Snowflake's internal location. This field accepts the following input:
  • Named Stage: The name for user-defined named stage. This should be used only when a Staging location is set as Internal. Format: @<Schema>.<StageName>[/path]
  • Internal Path: The staging location represent by a path. Format: @~/[path]
  • S3 Url: The external S3 URL that specifies an S3 storage bucket. Format: s3://[path]
  • Microsoft Azure Storage Blob URL: The external URL required to connect to the Microsoft Azure Storage.
  • Folder Name: Anything else (including no input). This is regarded as a Folder name under the Internal Home Path (@~) if using internal staging or under the S3 bucket and folder specified in the Snowflake account.
    Note: The value for the expression has to be provided as a pipeline parameter and cannot be provided from the Upstream Snap for performance reasons when you use expression values.

Default value: N/A

Example: s3://test_bucket

Storage Integration String/Expression Specify the pre-defined storage integration that is used to authenticate the external stages.
Note: The value for the expression has to be provided as a pipeline parameter and cannot be provided from the Upstream Snap for performance reasons when you use expression values.

Default value: N/A

Example: External

Staged file list Use this field set to define staged file(s) to be loaded to the target file.
Staged file String Appears when you select Staged files for Data so urce. Specify the staged file to be loaded to the target table.

Default value: N/A

Example: File_name

File name pattern String Appears when you select Staged files for Data source. Specify a regular expression pattern string, enclosed in single quotes with the file names and /or path to match.

Default value: N/A

Example: .length

File format object String Specify an existing file format object to use for loading data into the table. The specified file format object determines the format type such as CSV, JSON, XML, AVRO, or other format options for data files.

Default value: N/A

Example: jsonPath()

File format type String Specify a predefined file format object to use for loading data into the table. The available file formats include CSV, JSON, XML, and AVRO.
Note: This Snap supports only CSV and NONE as file format types when the Datasource is Input view.

Default value: N/A

Example: CSV

File format option String Specify the file format option. Separate multiple options by using blank spaces and commas.
Note:

You can use various file format options including a binary format which passes through in the same way as other file formats. Learn more: File Format Type Options. Before loading binary data into Snowflake, you must specify the binary encoding format, so that Snap can decode the string type to binary types before loading it into Snowflake. This can be done by selecting the following binary file format: BINARY_FORMAT=xxx (Where XXX = HEX|BASE64|UTF8). However, the file you upload and download must be in similar formats. For instance, if you load a file in HEX binary format, you should specify the HEX format for download as well.

Note: When using external staging locations:
  • When loading numeric data from staged files, you must provide file format options depending on the user's data.
  • Do not use the following file format options if you selected Input view in the Data source field.
    • FIELD_DELIMITER
    • RECORD_DELIMITER
    • FIELD_OPTIONALLY_ENCLOSED_BY

Default value: N/A

Example: BINARY_FORMAT=UTF8

Table Columns Use this field set to specify the columns to be used in the COPY INTO command. This only applies when the Data source is Staged files.
Columns Default String Appears when you select Staged files for Data source. Specify the staged file to be loaded to the target table.

Default value: N/A

Example: File_name

Select Query String/Expression

Default value: N/A

Example: select substr(t.$2,4), t.$1, t.$5, t.$4 from @mystage t

Appears when the Data source is Staged files.

Specify the SELECT query to transform data before loading it into the Snowflake database.

The SELECT statement transform option enables querying the staged data files by either reordering the columns or loading a subset of table data from a staged file. For example, select $1:location, $1:dimensions.sq_ft, $1:sale_date, $1:price from @mystage/sales.json.gz t

This query loads the file sales.json from the internal stage mystage, (which stores the data files internally); wherein location, dimensions.sq_ft, and sale_date are the objects.

(OR)

select substr(t.$2,4), t.$1, t.$5, t.$4 from @mystage t

This query reorders the column data from the internal stage mystage before loading it into a table. The (SUBSTR), SUBSTRING function removes the first few characters of a string before inserting it.

Note: We recommend you not use a temporary stage while loading your data.
Encryption type Dropdown list

Default value: None

Example: Server-Side Encryption

Specify the type of encryption to be used on the data. The available encryption options are:

  • None: Files do not get encrypted.
  • Server Side Encryption: The output files on Amazon S3 are encrypted with server-side encryption.
  • Server-Side KMS Encryption: The output files on Amazon S3 are encrypted with an Amazon S3-generated KMS key.

The KMS Encryption option is available only for S3 Accounts (not for Azure Accounts and GCS) with Snowflake.

Note: If Staging Location is set to Internal, and when Data source is Input view, the Server Side Encryption and Server-Side KMS Encryption options are not supported for Snowflake snaps: This happens because Snowflake encrypts loading data in its internal staging area and does not allow the user to specify the type of encryption in the PUT API. Learn more: Snowflake PUT Command Documentation.
KMS key String/Expression

Default value: N/A

Example: <Encrypted>

Specify the KMS key that you want to use for S3 encryption. Learn more about the KMS key: AWS KMS Overview and Using Server Side Encryption.

Buffer size (MB) String/Expression

Default value: 10MB

Example: 20MB

Specify the data in MB to be loaded into the S3 bucket at a time. This property is required when bulk loading to Snowflake using AWS S3 as the external staging area.

Minimum value: 5 MB

Maximum value: 5000 MB

Note: S3 allows a maximum of 10000 parts to be uploaded so this property must be configured accordingly to optimize the bulk load. Refer to Upload Part for more information on uploading to S3.
Manage Queued Queries Dropdown list

Default value: Continue to execute queued queries when the Pipeline is stopped or if it fails

Example: Cancel queued queries when the Pipeline is stopped or if it fails

Select this property to determine whether the Snap should continue or cancel the execution of the queued Snowflake Execute SQL queries when you stop the pipeline.

If you select Cancel queued queries when the pipeline is stopped or if it fails, then the read queries under execution are canceled, whereas the write type of queries under execution are not canceled. Snowflake internally determines which queries are safe to be canceled and cancels those queries.

On Error Dropdown list Select an action to perform when errors are encountered in a file. The available actions are:
  • ABORT_STATEMENT: Aborts the COPY statement if any error is encountered. The error will be thrown from the Snap or routed to the error view.
  • CONTINUE: Continues loading the file. The error will be shown as part of the output document.
  • SKIP_FILE: Skips file if any errors are encountered in the file.
  • SKIP_FILE_*error_limit*: Skips file when the number of errors in the file exceeds the specified limit.
  • SKIP_FILE_*error_percent_limit*%: Skips file when the percentage of errors in the file exceeds the specified threshold.
Error Limit Integer Appears when you select SKIP_FILE_*error_limit* for On Error. Specify the error limit to skip file loading when the number of errors exceeds the limit.

Default value: 0

Example: 3

Error Percentage Limit Integer Appears when you select SKIP_FILE_*error_percent_limit*% for On Error. Specify the percentage of errors to skip file loading.

Default value: 0

Example: 1

Size Limit Integer Specify the maximum size (in bytes) of data to be loaded. At least one file is loaded regardless of this setting unless there are no files. Null indicates no limit.
Note: At least one file is loaded regardless of the value specified for SIZE_LIMIT unless there is no file to be loaded. A null value indicates no size limit.

Default value: 0

Example: 5

Purge Checkbox Appears when the Staging location is External. Specify whether to purge the data files from the location automatically after the data is successfully loaded.

Default status: deselected

Return Failed Only Checkbox Specify whether to return only files that have failed to load while loading.

Default status: deselected

Force Checkbox Specify if you want to load all files, regardless of whether they have been loaded previously and have not changed since they were loaded.

Default status: deselected

Truncate Columns Checkbox Select this checkbox to truncate column values that are larger than the maximum column length in the table.

Default status: deselected

Strings considered as null String/Expression Specify the strings to be considered as null when loading into Snowflake. You can configure multiple case-sensitive comma-separated strings from the input data as null placeholders. This ensures effective handling of null values without impacting existing data and optimizes the storage space in the internal stage.

Default value: N/A

Example: ^_AB, _nullIf

Validation Mode Dropdown list Select the validation mode for visually verifying the data before unloading it. The available options are:
  • NONE
  • RETURN_n_ROWS
  • RETURN_ERRORS
  • RETURN_ALL_ERRORS

Default value: N/A

Example: RETURN_n_ROWS

Validation Errors Type Dropdown list Appears when Validation Mode is not NONE. Select one of the following methods for displaying the validation errors:
  • Aggregate errors per row: Provides a summary view of the errors. You can expand rows to reveal a detailed view of the errors.
  • Full error: Provides the complete error message.
  • Do not show errors: Do not show errors

Default value: Full error

Example: Do not show errors

Rows to Return Integer Appears when you select RETURN_n_ROWS, RETURN_ERRORS, or RETURN_ALL_ERRORS for Validation Mode. Specify the number of rows not loaded into the corresponding table. Instead, the data is validated to be loaded and returns results based on the validation option specified. It can be one of the following values: RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS.

Default value: 0

Example: 5

Snap execution Dropdown list
Choose one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Execute only

Example: Validate & Execute

Instead of building multiple Snaps with interdependent DML queries, we recommend you use the Stored Procedure or the Multi Execute Snaps.

In a scenario where the downstream Snap depends on the data processed on an upstream database Bulk Load Snap, use the Script Snap to add delay for the data to be available.

For example, when performing a create, insert, and delete function sequentially on a pipeline, the Script Snap helps to create a delay between the insert and delete function. Otherwise, the delete function may get triggered before inserting the records into the table.

Troubleshooting

Data can only be read from Google Cloud Storage (GCS) with the supplied account credentials (not written to it).

Snowflake Google Storage Database accounts do not support external staging when the Data source is the Input view

Data can only be read from GCS with the supplied account credentials (not written to it).

Use internal staging if the data source is the input view or change the data source to staged files for Google Storage external staging.