Use Case: Predict customer churn
Overview
This use case demonstrates predicting customer churn using machine learning, allowing service providers to identify at-risk customers and implement retention strategies.
Problem Scenario
Customer churn is a significant concern for service providers. High churn can lead to revenue loss and may signal service issues. By analyzing customer data, we aim to predict which customers are likely to leave and determine the primary factors influencing churn.
Dataset
We use a telecommunications dataset with 7043 customer records and 21 fields, covering demographics, service subscriptions, and a churn indicator. There are 1869 records marked as churned.
Objectives
- Profiling: Use Profile from ML Analytics Snap Pack to generate data statistics.
- Data Preparation: Prepare the dataset using Snaps from ML Data Preparation Snap Pack Snap Pack.
- Cross Validation: Perform 10-fold cross-validation with various classification algorithms using Cross Validator (Classification).
- Model Building: Use the Trainer (Classification) Snap to build a logistic regression model.
- Model Hosting: Host the model as a REST API with Ultra Task.
- API Testing: Test the API with sample requests using REST Post.
- Visualization API: Create a visualization API to display selected data fields.
Profiling
The File Reader Snap loads the CSV dataset, parsed by the CSV Parser Snap to convert it into documents. A Type Converter Snap ensures appropriate data types for accurate profiling by the Profile Snap, which produces both data statistics and an interactive HTML profile. A policy converts the $SeniorCitizen field from numeric to categorical.
Data preparation
Data preparation includes removing the $customerID field and handling missing values in $TotalCharges. The Mapper Snap excludes $customerID, while the Clean Missing Values Snap imputes averages for $TotalCharges. The required statistics are loaded from the profiling output.
Cross validation
Using the Cross Validator (Classification) Snap, a 10-fold cross-validation tests multiple classification algorithms for accuracy. A parent pipeline calls the child pipeline with different algorithm parameters, capturing and aggregating accuracy results to identify the best algorithm. The baseline accuracy for predicting no churn is 73.5%.
Model building
A logistic regression model is built with the Trainer (Classification) Snap, using the processed dataset from the preparation pipeline. Metadata and the serialized model are saved to SnapLogic File System (SLFS).
Model hosting
The model is hosted as a REST API using Ultra Task. The Predictor (Classification) Snap applies the model to incoming data, with a Filter Snap for authentication and Mapper Snaps for request extraction and response preparation.
API testing
Visualization API
A visualization API is implemented with Remote Python Script Snap, presenting data through custom visualizations. The script utilizes SnapLogic methods for efficient data processing and visualization.