Deduplicate

Overview

You can use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, the Snap evaluates each criterion separately and then aggregates the results. This Snap ignores fields with empty strings and whitespaces as no data.


Deduplicate Snap Overview

Prerequisites

None.

Limitations and known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input This Snap supports one document input view. It processes a document with data containing duplicate records.
Output This Snap supports up to two document output views.
  • First output view: A document containing deduplicated records.
  • Second output view: Displays a document containing the duplicate records.
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Legend:
  • Expression icon (): JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
  • SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
  • Suggestion icon (): Populates a list of values dynamically based on your Account configuration.
  • Upload : Uploads files. Learn more.
Learn more about the icons in the Snap settings dialog.
Field / field set Type Description
Label String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Deduplicate

Example: Deduplicate address lines
Threshold Decimal

Required. The minimum confidence required for documents to be considered duplicates using the matching criteria.

Minimum value: 0

Maximum value: 1

Default value: 0.8

Example: 0.95

Confidence Checkbox

Select this checkbox to include each match's confidence levels in the output.

Default status: Deselected

Group ID Checkbox

Select this checkbox to include the group ID for each record in the output.

Default status: Deselected

Matching Criteria

Enables you to specify the settings you want to use to match input documents with the matching criteria.

Field JSONPath

The field in the input dataset that you want to use for matching and identifying duplicates.

Default value: None.

Example: $name

Cleaner String

Select the cleaner that you want to use on the selected fields.

Important:

A cleaner makes comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize and lowercase text.

Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available:
  • None
  • Text
  • Number
  • Date Time

Default value: None.

Example: Text

Comparator Dropdown list
Important:

A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal).

Choose the comparator that you want to use on the selected fields, from the drop-down list:
  • Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another.
  • Weighted Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. Each type of symbol has a different weight: number has the highest weight, while punctuation has the lowest weight. This makes "Main Street 12" very different from "Main Street 14", while "Main Street 12" is quite similar to "MainStreet12".
  • Longest Common Substring: Identifies the longest string that is a substring of both strings.
  • Q-Grams: Breaks a string into a set of consecutive symbols; for example, 'abc' is broken into a set containing 'ab' and 'bc'. Then, the ratio of the overlapping part is calculated.
  • Exact: Identifies and classifies a match as either an exact match or not a match at all. An exact match assigns a score that equals the value in High. Else, it assigns a score that equals the value in Low.
  • Soundex: Compares strings by converting them into Soundex codes. These codes begin with the first letter of the name, followed by a three-digit code that represents the first three remaining consonants. The letters A, E, I, O, U, Y, H, and W are not coded. Thus, the names 'Mathew' and 'Matthew' would generate the same Soundex code: M-300. This enables you to quickly identify strings that refer to the same person or place, but have variations in their spelling.
  • Metaphone: Metaphone is similar to Soundex; only it improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding.
  • Numeric: Calculates the ratio of the smaller number to the greater.
  • Date Time: Computes the difference between two date-time data and produces a similarity measure ranging from 0.0 (meaning completely different) and 1.0 (meaning exactly equal). This property requires data in epoch format. If the date-time data in your dataset is not in epoch format, you must select Date Time in the Cleaner field to convert the date-time data into the epoch format.

Default value: Levenshtein

Example: Metaphone

Low Decimal A decimal value representing the level of probability of the input documents to be matched if the specified fields are completely unlike.
Important: If this value is left empty, a value of 0.3 is applied automatically.

Default value: None.

Example: 0.1

High Decimal A decimal value representing the level of probability of the input documents to be matched if the specified fields are a complete match.
Important: If this value is left empty, a value of 0.95 is applied automatically.

Default value: None.

Example: 0.8

Minimum memory (MB) Integer/Expression Specify a minimum cut-off value for the memory the Snap must use when processing the documents. If the available memory is less than the specified value, the Snap stops execution and displays an exception to prevent the system from running out of memory.
  • This feature is disabled if this value is 0.
  • A lint message for the available memory and free disk space is displayed in the Pipeline Execution Statistics.

Default value: 200

Example: 1000

Minimum free disk space (MB) Integer/Expression Specify the minimum free disk space required for the Snap to execute. If the free disk space is less the than the specified value, the Snap stops execution and displays an exception to prevent the system from running out of disk space.
  • This feature is disabled if this value is 0.
  • A lint message for the available memory and free disk space is displayed in the Pipeline Execution Statistics.

Default value: 200

Example: 1000

Snap execution Dropdown list
Select one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Validate & execute

Example: Execute only

Temporary files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder in Configuration Options.