Use Case: Speech-to-Text transcription using DeepSpeech

Overview

This use case demonstrates the application of speech-to-text transcription using the DeepSpeech engine, enabling machines to transcribe spoken language accurately.

Problem Scenario

Natural Language Processing (NLP) has enabled advancements in human-machine communication, particularly through applications like chatbots and virtual assistants. Recent improvements in speech recognition, powered by deep neural networks, allow machines to accurately understand human speech, making real-time transcription possible.

Description

We use Mozilla’s DeepSpeech, an open-source speech-to-text engine with a 5.6% word error rate, to transcribe speech to text. This use case involves deploying the DeepSpeech model in SnapLogic to provide an API for transcription, including loading, testing, and hosting the model.

Objectives

Model Testing: Use the Remote Python Script Snap to test the pre-trained DeepSpeech model using sample audio data.
Model Hosting: Host the DeepSpeech model via the Remote Python Script Snap and schedule an Ultra Task to create an API.
API Testing: Use the REST Post Snap to verify API functionality by sending a sample request to the Ultra Task.

Model testing

In this pipeline, we use the File Reader Snap to read a sample audio file, then encode the binary audio stream to base64 using the Binary to Document Snap. The base64-encoded audio is sent to the Remote Python Script Snap, which applies the DeepSpeech model for transcription.

The following Python script illustrates the setup in the Remote Python Script Snap:

The main functions are:

snaplogic_init: Initializes and loads the DeepSpeech model.
snaplogic_process: Processes each document, decodes audio, converts it to the appropriate format, and performs transcription.
snaplogic_final: Finalizes processing after all documents are consumed.

Model hosting

This pipeline is scheduled as an Ultra Task, providing a REST API for external transcription requests. Core components include the Remote Python Script Snap, which now extracts the $audio field from API requests. A Filter Snap authenticates the requests by verifying the token in the pipeline parameters.

Fields are extracted via the Extract Params Snap, and the response is mapped in the Prepare Response Snap. CORS headers are added to support cross-origin requests.

Building API

To deploy this pipeline as a REST API, click the calendar icon in the toolbar and select either a Triggered Task or an Ultra Task. The Ultra Task is preferable for low-latency needs. In this case, authentication is handled internally by the Filter Snap, so a Bearer token is not required.

To access the API URL, navigate to Show tasks in this project in the Manager, locate the task, and view its details.

API testing

In this pipeline, a sample request is generated using the JSON Generator Snap. The request is sent to the Ultra Task via the REST Post Snap, and the Mapper Snap extracts the response in $response.entity.

The sample response shows the transcribed text of the audio, with information on the transcription time and audio length.