Identify the common words in a dataset

This example pipeline demonstrates how to calculate the frequency of common words appearing in a dataset of tokenized sentences.

  1. Configure the JSON Generator Snap to pass your input data.
    The input data contains an array of tokens created using the Yelp dataset. Learn more on how to create similar arrays.
    Note: In this example, we use the JSON Generator Snap. However, you can replace the JSON Generator Snap with any Snap of your choice, such as the Chunker, Constant, File Reader, or S3 File Reader Snaps.

    JSON Generator Snap - Edit JSON

  2. Configure the Mapper Snap as $text to retain only the text from the input document.
    On validation, the Snap displays the mapped data to be further used in the Bag of Words Snap.
    Mapper Snap Configuration Mapper Snap Output

    Mapper Snap Configuration


    Mapper Snap Output

  3. Configure the File Reader Snap to reads the contents of the yelp_common_words.json file, which contains the top 100 common words from the Yelp dataset.
    The input data contains an array of tokens created using the Yelp dataset. Learn more on how to create similar arrays with an example provided in the Tokenizer documentation. On validation, the Snap displays the read contents of the yelp_common_words.json file.
    File Reader Snap Configuration File Reader Snap Output

    File Reader Snap Configuration


    File Reader Snap Output

  4. Configure the JSON Parser Snap to parse JSON data from the binary data input.
    On validation, the Snap provides a document to further use in the Bag of Words Snap.
  5. Configure the Bag of Words Snap with two input views:
    • Connect the first input view to the Mapper Snap to process the array of tokenized words in each sentence.
    • Connect the second input view to the JSON Parser Snap to utilize the array listing out the frequency of the 100 most common words for the same dataset.
    On validation, the Snap displays the a detailed document outlining the frequency with which the set of common words appears in each tokenized sentence, providing valuable insights into word usage patterns within the dataset.
    Bag of Words Snap Configuration Bag of Words Snap Output

    Bag of Words Snap Configuration


    Bag of Words Snap Output

    Note: After the data is generated, you can use Snaps such as the Filter and Aggregate Snaps for advanced processing. Further, you can use GenAI Builder to integrate machine learning models.
To successfully reuse pipelines:
  1. Download and import the pipeline into SnapLogic.
  2. Configure Snap accounts as applicable.
  3. Provide pipeline parameters as applicable.