BigQuery Bulk Load - Cloud Storage

Overview

This Snap performs a bulk load operation into a Google BigQuery database. Depending upon the Snap's configuration, it does so either by using data from incoming documents or by using existing files in the Google Cloud Storage bucket. The Snap supports all three file types supported by Google BigQuery - CSV, JSON, and AVRO.

When using incoming documents:

In the case where data from incoming documents is being loaded, the data is first uploaded to a temporary file on Google Cloud Storage and from that temporary file the data is loaded into the destination table. The user can choose to either retain or delete this temporary file after the Snap terminates.

When using existing files from Google Cloud Storage:

In the case where existing files from Google Cloud Storage are being used, the data is loaded directly from the specified files into the destination table, there are no temporary files created for this operation. However, the user can choose to either retain or delete these existing files after the Snap terminates.

ETL Transformations and Data Flow:

The BigQuery Bulk Load (Cloud Storage) Snap performs a bulk load of the input records into the specified database. If the data is being loaded from incoming documents, it is sent to a temporary file in the cloud storage and from there to the destination table. The temporary file is retained after Snap's execution. However, the user can choose to delete it if they so wish and configure the Snap accordingly.

Snap Type: Write
Works in Ultra Tasks

Important:

Consider the following points when using this Snap:

Ensure the schema of the incoming documents or existing files matches the schema of the destination table if CSV format is used. Refer to the Troubleshooting section for a workaround in this case.
Ensure when using existing files that the destination table exists.
Ensure when using incoming documents that the selected file type for the temporary file supports the data types in the incoming document. For example, CSV file format does not support arrays/lists, AVRO file format does not support DATE TIME data type, and so on.
The JSON format works for any data type and any complex schema.

Important:

Snaps in Google BigQuery Snap Pack

Write datetime values to the database tables, always in UTC format.
Convert any non-UTC values in the incoming data to UTC before writing them.
Consider datetime values without the time zone mentioned, as in UTC.

So, ensure that you include the time zone in all the datetime values that you load into Google BigQuery tables using this Snap.

For example: "2020-08-29T18:38:07.370 America/Los_Angeles", "2020-09-11T10:05:14.000-07:00", "2020-09-11T17:05:14.000Z"

Supported Accounts

This Snap works with the following account types. For more information, see Configuring Google BigQuery Accounts.

Prerequisites

Write access to the Google BigQuery Account and Read & Write access from/to the Google Cloud Storage account is required.

Limitations and Known Issues

When using incoming documents with arrays/lists, CSV cannot be selected as the file format for the temporary file. This will cause an error. Select JSON or AVRO instead.
When using incoming documents with date time data types, AVRO cannot be selected as the file format for the temporary file. This will cause an error. Select CSV or JSON.
When uploading incoming documents via a CSV file or uploading an existing CSV file from Google Cloud Storage, make sure the CSV file contains all the destination table columns in the same order as per the table. Otherwise, use the workarounds in the Troubleshooting section to handle a CSV file that does not contain all the table columns in the same order.

When uploading from existing documents on Google Cloud Storage, enabling the Create table if not the present check box throws an exception error. To avoid this, ensure the destination table exists.

Known Issue

Copying data by creating a table with the same name in Google BigQuery immediately after deleting it, might not insert the rows as expected. This behavior is due to the way the tables are cached and the internal table ID is propagated throughout the system.

Workaround

We recommend you avoid rewriting in Google BigQuery and suggest the following workarounds. You can choose to use them individually or in unison, to suit your requirement.

Truncate the existing table instead of deleting it.
Add some randomly generated prefix to the table name each time you create the table (new or unique table name every time).

Snap views


Type	Description	Examples of upstream and downstream Snaps
Input	This Snap has at most one document input view. Any Snap that can pass a document output view, such as Structure or JSON Generator. Pipleline parameters can also be passed, only for bucket ID, project ID, table ID, and so on. In cases where existing files from the Google Cloud Storage bucket are being used, then input data is taken from the source files.	Mapper JSON Generator
Output	This Snap has exactly one document output view. The output is in document view format. It displays the statistics for the completed bulk load operation.	Mapper
Learn more about Error handling.

Snap settings

Note: Learn about the common controls in the Snap settings dialog.


Field/Field set	Description
Label String	Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: N/A Example: BigQuery Bulk Load Cloud Storage
Project ID String/Expression	Required. Specify the Project ID of the project containing the table. Default value: None. Example: project1234
Dataset ID String/Expression	Specify the dataset ID of the dataset containing the table. Default value: None. Example: dataset1234
Table ID String/Expression	Required. Specify the name of the target table into which the records are to be loaded. This is a suggestible field and all the tables in the datasets will be listed. Default value: None. Example: table1234
Create table if not present Checkbox	Select this checkbox to create the table automatically if it does not already exist. The Snap infers the table schema from the incoming documents and creates the table with the inferred schema. Default: Not selected Warning: When using "Use existing files Upload type" option Ensure that you do not select Create table if not present check box if you choose the Upload existing files on Google Cloud Storage option for Upload type, else an exception is thrown and the Snap does not execute.
Bucket name String/Expression	Specify the name of the Google Cloud Storage bucket to be used for the operation. This is a suggestible field and will list all the buckets within the given account. Default value: None. Example: project1234
Upload type Dropdown list	Choose an upload type for uploading your documents. The available options are: Upload incoming documents Upload existing files from Google Cloud Storage: To upload data from existing files in Google Cloud Storage bucket select the option Upload existing files from Google Cloud Storage. These options specify the data source to the Snap. Incoming files or existing files in Google Cloud Storage bucket have to be uploaded. Default: Upload existing files from Google Cloud Storage Example: Upload incoming documents Important: Based on the option you choose, only one of the succeeding sections has to be configured. The succeeding sections being Properties for uploading incoming documents and Properties for uploading existing files from Google Cloud Storage.
Properties for uploading incoming documents
This sub-section has to be configured if the option Upload incoming documents was selected in the Upload type property.
File format Dropdown list	Choose the file preferred format of the temporary file. The available options are: CSV, JSON, and AVRO. Default: CSV Example: JSON Warning: File formats and their limitations The file format should be selected based on the data type they support, for example if the incoming document contains arrays or lists then selecting CSV in the File format property will throw an execption and the Snap will not execute. To avoid this, AVRO or JSON must be selected. Similarly, AVRO file format should not be used for Date Time data types. If the incoming documents do not contain all the table columns in the same order as the destination table then do not use CSV.
Temp file name String/Expression	The name of the temporary file that is created on the Google Cloud Storage bucket. If a filename is not provided, then a system generated file name is used. Important: When using a temporary file to load data, use a unique temporary file for each of the BigQuery Bulk Load Snaps in the Pipeline. Else, the snaps may fail randomly with `503 Service Unavailable` error. Default: [None]
Preserve temp file Checkbox	Specify whether the temporary file created for the load operation has to be retained or deleted after the Snap's execution. By default the temporary file is deleted. Default: Selected
Batching Checkbox	Select if you want the Snap to process the input records in batches.
Batch Size Integer/Expression	The number of records batched per request. If the input has 10,000 records and the batch size is set to 100, the total number of requests would be 100.
Batch Timeout (milliseconds) Integer/Expression	Time in milliseconds to elapse following which the batch if not empty will be processed even though it might be lesser than the given batch size.
Properties for uploading existing files from Google Cloud Storage
This sub-section has to be configured if the option "Upload existing files from Google Cloud Storage" was selected in the Upload type property.
File paths	Multiple files can be selected based on the need. When the pipeline is executed, the output data will have as many records listed. Based on the number of files added, the Snap will group them into categories (CSV with header & delimiter, CSV with delimiter but without header, JSON, and AVRO). This distinction is maintained in the output preview as well (shown distinctly according to File type).
File format Dropdown list	This is a drop-down list that has three options: CSV, JSON, and AVRO. Default: CSV
File path String/Expression	The file's location in the Google Cloud Storage bucket. Example: gs://gcs_existingbucket/exisitng_file.csv. Default: [None]
CSV file contains headers Checkbox	Specifies that the CSV file contains headers. Use this option to enable the Snap to differentiate between the headers and records. Default: Not selected
CSV delimiter String/Expression	Specify the delimiter for the CSV file. This is needed only for CSV file types. Default: , (comma) Example: \| (pipe) Important: Custom delimiters All custom delimiters supported by BigQuery are supported as well.
Delete files upon exit Checkbox	Similar operation as Preserve temp file. If this option is enabled, the files from which the data is loaded to the destination table are deleted after Snap's execution. Default: Not selected
Snap execution `Dropdown list`	Choose one of the three modes in which the Snap executes. Available options are: `Validate & Execute`. Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime. `Execute only`. Performs full execution of the Snap during pipeline execution without generating preview data. `Disabled`. Disables the Snap and all Snaps that are downstream from it. Default value: Execute only Default value: Validate & Execute

Implicit retries in BigQuery Snaps

The BigQuery Snaps handle all retriable BigQuery errors (BigQuery exception, IO exception, and Runtime exception) internally.

429 (Too Many Requests):
- Retry attempts: Maximum of 5 retries.
- Delay Between Retries: Backoff strategy with jitter (random variation) is applied to prevent synchronized retries and reduce load.
401 (Unauthorized):
- Retry attempts: Maximum of 3 retries.
- Delay Between Retries: Backoff strategy is applied.
- Additional Actions: Reloads the BigQuery account on the retry event.
IOException and 500, 502, 503, 504 (Server Errors):
- Retry attempts: Maximum of 3 retries.
- Delay Between Retries: Backoff strategy is applied.

Note on Writing Numeric Values

Important:

Writing numeric values into Google BigQuery tables

Google BigQuery tables support columns with NUMERIC data type to allow storing big decimal numbers (up to 38 digits with nine decimal places). But Snaps in Google BigQuery Snap Pack that load data into tables cannot create numeric columns. When the Create table if not present check box is selected, the Snaps create the required table schema, but map big decimals to a FLOAT64 column. So, to store the data into numeric columns using these Snaps, we recommend the following actions:

Create the required schema, beforehand, with numeric columns in Google BigQuery.
Pass the number as a string.

The Google API converts this string into a number with full precision and saves it in the numeric column.

Example:


Value Passed Through Snap	Value Stored in BigQuery	Remarks
`"12345678901234567890123456789.123456789"`	12345678901234567890123456789.123456789	As per this issue logged in Google Issue Tracker, if you send the values as strings, the values are never converted to floating-point form, so this works as expected.
`12345678901234567890123456789.123456789`	123456789012345678000000000000	Big decimal values sent as non-string values lose precision.

Troubleshooting

Mismatch in the order of columns/number of columns between incoming document/existing Google Cloud Storage files in CSV format:

Incoming documents: Use JSON/AVRO format for the temp file in the Cloud Storage Bulk Load Snap.
Existing files from Google Cloud Storage: Read the file using File Reader Snap followed by the CSV Parser Snap, then use BigQuery Bulk Load (Cloud Storage) Snap and selec Upload type as Upload incoming documents and File format as JSON or AVRO.