Match

To identify matching records across datasets that do not have a common key field

Overview

This Snap performs record linkage to identify documents from different data sources (input views) that may represent the same entity without relying on a common key. The Match Snap enables you to automatically identify matched records across datasets that do not have a common key field.

Note: This Snap uses Duke, which is a library for performing record linkage and deduplication, implemented on top of Apache Lucene.

Match Snap Overview

Prerequisites

None.

Limitations and known issues

None.

Snap views

View Description Examples of upstream and downstream Snaps
Input This Snap has exactly two document inputs.
  1. The first dataset that must be matched with the second dataset.
  2. The second dataset that must be matched with the first dataset.
Output This Snap has at most three document input views.
  1. First Output: The matched documents and, optionally, the confidence level associated with the matching.
  2. Second Output: Optional. Unmatched documents from the first dataset.
  3. Third Output: Optional. Unmatched documents from the second dataset.
Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

  • Stop Pipeline Execution Stops the current pipeline execution when an error occurs.
  • Discard Error Data and Continue Ignores the error, discards that record, and continues with the remaining records.
  • Route Error Data to Error View Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap settings

Legend:
  • Expression icon (): JavaScript syntax to access SnapLogic Expressions to set field values dynamically (if enabled). If disabled, you can provide a static value. Learn more.
  • SnapGPT (): Generates SnapLogic Expressions based on natural language using SnapGPT. Learn more.
  • Suggestion icon (): Populates a list of values dynamically based on your Account configuration.
  • Upload : Uploads files. Learn more.
Learn more about the icons in the Snap settings dialog.
Field / field set Type Description
Label String

Required. Specify a unique name for the Snap. Modify this to be more appropriate, especially if more than one of the same Snaps is in the pipeline.

Default value: Match

Example: Match string values

Threshold String/Expression

Required. The minimum confidence required for documents to be considered matched.

Minimum value: 0

Maximum value: 1

Default value: 0.8

Confidence Checkbox

Required. Select this check box to include each match's confidence levels in the output.

Default status: Deselected

Match all Checkbox

Required. Select this check box to match one record from the first input with multiple records in the second input. Else, the Snap matches the first record of the second input with the first record of the first input.

Default status: Deselected

Matching Criteria

Enables you to specify the settings that you want to use to perform the matching between the two input datasets.

Left field String/Suggestion

The field in the first dataset that you want to use for matching. This property is a JSONPath

Default value: N/A

Example: $name

Right field String/Suggestion

The field in the second dataset that you want to use for matching. This property is a JSONPath

Default value: N/A

Example: $country

Cleaner String/Expression/Suggestion
Select the cleaner that you want to use on the selected fields. Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available:
  • None
  • Text
  • Number
  • Date Time

Default value: None

Example: $name

Comparator String/Suggestion
Important:

A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal).

Choose the comparator that you want to use on the selected fields, from the drop-down list:
  • Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another.
  • Weighted Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. Each type of symbol has a different weight: number has the highest weight, while punctuation has the lowest weight. This makes "Main Street 12" very different from "Main Street 14", while "Main Street 12" is quite similar to "MainStreet12".
  • Longest Common Substring: Identifies the longest string that is a substring of both strings.
  • Q-Grams: Breaks a string into a set of consecutive symbols; for example, 'abc' is broken into a set containing 'ab' and 'bc'. Then, the ratio of the overlapping part is calculated.
  • Exact: Identifies and classifies a match as either an exact match or not a match at all. An exact match assigns a score that equals the value in High. Else, it assigns a score that equals the value in Low.
  • Soundex: Compares strings by converting them into Soundex codes. These codes begin with the first letter of the name, followed by a three-digit code that represents the first three remaining consonants. The letters A, E, I, O, U, Y, H, and W are not coded. Thus, the names 'Mathew' and 'Matthew' would generate the same Soundex code: M-300. This enables you to quickly identify strings that refer to the same person or place, but have variations in their spelling.
  • Metaphone: Metaphone is similar to Soundex; only it improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding.
  • Numeric: Calculates the ratio of the smaller number to the greater.
  • Date Time: Computes the difference between two date-time data and produces a similarity measure ranging from 0.0 (meaning completely different) and 1.0 (meaning exactly equal). This property requires data in epoch format. If the date-time data in your dataset is not in epoch format, you must select Date Time in the Cleaner field to convert the date-time data into the epoch format.

Default value: Levenshtein

Example: Metaphone

Low String/Expression

Enter a decimal value representing the level of probability of the records to be matched if the specified fields are completely unlike.

Default value: N/A

Example: 0.1

High String/Expression

Enter a decimal value representing the level of probability of the records to be matched if the specified fields are exact match.

Default value: N/A

Example: 0.8

Note: If this value is left empty, a value of 0.95 is applied automatically.
Snap execution Dropdown list
Select one of the three modes in which the Snap executes. Available options are:
  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during pipeline runtime.
  • Execute only: Performs full execution of the Snap during pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default value: Validate & execute

Example: Execute only

Temporary files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When processing larger datasets that exceed the available compute memory, the Snap writes unencrypted pipeline data to local storage to optimize the performance. These temporary files are deleted when the pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex node properties, which can also help avoid pipeline errors because of the unavailability of space. Learn more about Temporary Folder in Configuration Options.