Deduplicate

Overview

You can use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, the Snap evaluates each criterion separately and then aggregates the results. This Snap ignores fields with empty strings and whitespaces as no data.

Transform-type Snap
Does not support Ultra Tasks

Prerequisites

None.

Limitations and known issues

None.

Snap views


View	Description	Examples of upstream and downstream Snaps
Input	This Snap supports one document input view. It processes a document with data containing duplicate records.	Mapper
Output	This Snap supports up to two document output views. First output view: A document containing deduplicated records. Second output view: Displays a document containing the duplicate records.	Mapper
Error	Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are: Stop Pipeline Execution Stops the current pipeline execution when an error occurs. Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records. Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution. Learn more about Error handling in Pipelines.

Snap settings

Legend:

Expression icon (): JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
Suggestion icon (): Populates a list of values dynamically based on your Account configuration.
Upload : Uploads files. Learn more.

Learn more about the icons in the Snap settings dialog.


Field / field set	Type	Description
Label	String	Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline. Default value: Deduplicate Example: Deduplicate address lines
Threshold	Decimal	Required. The minimum confidence required for documents to be considered duplicates using the matching criteria. Minimum value: `0` Maximum value: `1` Default value: `0.8` Example: `0.95`
Confidence	Checkbox	Select this checkbox to include each match's confidence levels in the output. Default status: Deselected
Group ID	Checkbox	Select this checkbox to include the group ID for each record in the output. Default status: Deselected
Matching Criteria	Enables you to specify the settings you want to use to match input documents with the matching criteria.
Field	JSONPath	The field in the input dataset that you want to use for matching and identifying duplicates. Default value: None. Example: `$name`
Cleaner	String	Select the cleaner that you want to use on the selected fields. Important: A cleaner makes comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize and lowercase text. Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available: `None` `Text` `Number` `Date Time` Default value: None. Example: `Text`
Comparator	Dropdown list	Important: A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal). Choose the comparator that you want to use on the selected fields, from the drop-down list: `Levenshtein`: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. `Weighted Levenshtein`: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. Each type of symbol has a different weight: number has the highest weight, while punctuation has the lowest weight. This makes "Main Street 12" very different from "Main Street 14", while "Main Street 12" is quite similar to "MainStreet12". `Longest Common Substring`: Identifies the longest string that is a substring of both strings. `Q-Grams`: Breaks a string into a set of consecutive symbols; for example, 'abc' is broken into a set containing 'ab' and 'bc'. Then, the ratio of the overlapping part is calculated. `Exact`: Identifies and classifies a match as either an exact match or not a match at all. An exact match assigns a score that equals the value in High. Else, it assigns a score that equals the value in Low. `Soundex`: Compares strings by converting them into Soundex codes. These codes begin with the first letter of the name, followed by a three-digit code that represents the first three remaining consonants. The letters A, E, I, O, U, Y, H, and W are not coded. Thus, the names 'Mathew' and 'Matthew' would generate the same Soundex code: M-300. This enables you to quickly identify strings that refer to the same person or place, but have variations in their spelling. `Metaphone`: Metaphone is similar to Soundex; only it improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding. `Numeric`: Calculates the ratio of the smaller number to the greater. `Date Time`: Computes the difference between two date-time data and produces a similarity measure ranging from 0.0 (meaning completely different) and 1.0 (meaning exactly equal). This property requires data in epoch format. If the date-time data in your dataset is not in epoch format, you must select `Date Time` in the Cleaner field to convert the date-time data into the epoch format. Default value: `Levenshtein` Example: `Metaphone`
Low	Decimal	A decimal value representing the level of probability of the input documents to be matched if the specified fields are completely unlike. Important: If this value is left empty, a value of `0.3` is applied automatically. Default value: None. Example: `0.1`
High	Decimal	A decimal value representing the level of probability of the input documents to be matched if the specified fields are a complete match. Important: If this value is left empty, a value of `0.95` is applied automatically. Default value: None. Example: `0.8`
Minimum memory (MB)	Integer/Expression	Specify a minimum cut-off value for the memory the Snap must use when processing the documents. If the available memory is less than the specified value, the Snap stops execution and displays an exception to prevent the system from running out of memory. This feature is disabled if this value is `0`. A lint message for the available memory and free disk space is displayed in the Pipeline Execution Statistics. Default value: `200` Example: `1000`
Minimum free disk space (MB)	Integer/Expression	Specify the minimum free disk space required for the Snap to execute. If the free disk space is less the than the specified value, the Snap stops execution and displays an exception to prevent the system from running out of disk space. This feature is disabled if this value is `0`. A lint message for the available memory and free disk space is displayed in the Pipeline Execution Statistics. Default value: `200` Example: `1000`
Snap execution	Dropdown list	Select one of the three modes in which the Snap executes. Available options are: Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime. Execute only: Performs full execution of the Snap during pipeline execution without generating preview data. Disabled: Disables the Snap and all Snaps that are downstream from it. Default value: Validate & execute Example: `Execute only`

Temporary files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder in Configuration Options.