Use Case: Sentiment analysis using SnapLogic data science

Overview

This use case demonstrates how SnapLogic Machine Learning (ML) Snaps can be used to perform sentiment analysis, identifying and categorizing opinions in text as positive, negative, or neutral.

This process helps you identify a writer's attitude toward a product, service, or topic, classifying it as positive, negative, or neutral.

Sentiment analysis allows you to:

Gauge customer satisfaction: Analyze customer interactions, such as support calls, tickets, and survey responses, to understand how satisfied customers are with your product or service.
Detect fraudulent reviews: Identify fake reviews by analyzing the sentiment in review comments. Generally, higher star ratings align with more positive sentiments. In an era where online reviews can significantly impact business, sentiment analysis can help spot discrepancies, such as cases where the review text sentiment doesn’t match the star rating.
Gain insights from social media: Analyze data from social platforms like Twitter, Facebook, or Instagram. For instance, you can identify supporters and critics of a celebrity based on the sentiment of comments on their posts.

Use case structure

This use case is structured into key tasks that guide the process from data preparation to model hosting. Each section includes a functional overview and technical details to explain the pipeline components and their roles.

Dataset used

We use a subset of the Yelp dataset, containing user reviews. For simplicity, we include only 5-star (positive) and 1-star (negative) reviews to train a binary sentiment analysis model.

Building the sentiment analysis model

To build this model, we perform the following high-level tasks:

Data Preparation: Process the Yelp data to retain only 1-star and 5-star reviews, balance the dataset using stratified sampling, and generate relevant statistics on word usage.
Cross Validation: Use multiple algorithms in k-fold cross validation to determine the best-performing model.
Model Building: Train the sentiment analysis model using the optimal algorithm.
Model Hosting: Deploy the model as an API accessible through an Ultra Task.
API Testing: Test the API to ensure it provides accurate sentiment predictions.

Data preparation pipeline

This pipeline processes Yelp reviews to retain only the 1-star and 5-star data, balances the dataset, and performs tokenization and word frequency analysis. The key Snaps used are as follows:

File Reader: Reads the Yelp dataset from SLFS.
Filter: Retains only 1-star and 5-star reviews for the sentiment model.
Stratified Sampling: Balances the ratio of 1-star to 5-star reviews.
Mapper: Maps the review rating to sentiment (1-star as negative, 5-star as positive).
Tokenizer: Breaks reviews into words.
Common Words: Identifies the 200 most common words for analysis.
Bag of Words: Creates a vector of word frequencies.

Cross validation pipeline

Two pipelines perform k-fold cross validation with multiple algorithms to determine the best fit:

Cross Validator (Child pipeline): Runs k-fold cross validation on individual algorithms.
Pipeline Execute (Parent pipeline): Automates cross validation by executing the child pipeline on various algorithms, identifying the most accurate one using the Aggregate Snap.

Model building pipeline

This pipeline trains a logistic regression model based on cross-validation results and saves it to SLFS. The key Snap used here is:

Trainer (Classification): Trains the logistic regression model.
JSON Formatter and File Writer: Saves the model in JSON format.

Model hosting pipeline

This pipeline is deployed as an Ultra Task to provide a REST API for external applications to access the sentiment analysis model. Key components are:

Predictor (Classification): Hosts the trained model and performs sentiment predictions.
Filter: Receives new requests for analysis.
Tokenizer and Bag of Words: Prepares text data for analysis.

API testing pipeline

This pipeline tests the Ultra Task API by sending a sample sentiment analysis request and displaying the response.

JSON Generator: Generates a sample request with text and token.
REST Post: Sends the request to the Ultra Task API.
Mapper: Extracts the sentiment prediction from the response.