Introducing DeepSpeech

Written by Sue Gee

Wednesday, 29 April 2020

DeepSpeech 0.7.0 is the latest version of Mozilla's open source speech-to-text engine. It was released this week together with new acoustic models trained on American English and a new format for training data that should be faster.

DeepSpeech 0.7.0, a TensorFlow implementation of Baidu's DeepSpeech architecture, is at the cutting edge of automatic speech recognition technology and yet it has gone largely under the radar.

In fact it is an open source project that Mozilla has been working on since 2016. Its 0.1.0 release was in November 2017 and by the time we first reported on it when version 0.6.0 was released in December 2019 it had already seen five updates the, in accord with semantic versioning were backward incompatible, as is the latest release.

mozisppechmlbanner

So where did DeepSpeech spring from and how does it fit into the ongoing efforts of Mozilla Research into Speech & Machine Learning?

According to the project's documentation, its aim is to create a simple, open, and ubiquitous speech recognition engine.

Simple, in that the engine should not require server-class hardware to execute.
Open, in that the code and models are released under the Mozilla Public License.
Ubiquitous, in that the engine should run on many platforms and have bindings to many different languages.

The architecture of the engine was originally based on the one developed by Baidu and presented in a 2014 paper, Deep Speech: Scaling up end-to-end speech recognition. It has since diverged in many respects from the engine it was motivated by and the core of the engine is a recurrent neural network (RNN) trained to ingest speech spectrograms and generate English text transcriptions.

speechapp

DeepSpeech is composed of two main subsystems: an acoustic model and a decoder. The acoustic model is a deep neural network that receives audio features as inputs, and outputs character probabilities. The decoder uses a beam search algorithm to transform the character probabilities into textual transcripts that are then returned by the system.

The speech samples used come from a Mozilla project that we encountered at the beginning of 2019 - Common Voice, described as a "voice donation" project to improve virtual assistants.

Firefox needs technologies like DeepSpeech to keep up with the likes of Google Chrome, Google Home and Alexa. One of the most common reasons for not using Firefox is that Chrome has services such as translation. Mozilla really does need to get on top of open source AI.

Details of DeepSpeech 0..7.0 and notable changes from the previous release can be found on its GitHub repo, along with its source code and its two acoustic models.

deepspeech

More Information

DeepSpeech On GitHub

DeepSpeech In NuGet Gallery

Mozilla DeepSpeech Gets Smaller

Mozilla Labs Quietly Relaunched

Adversarial Attacks On Voice Input

The State Of Voice As UI

Mozilla Layoffs Raise Questions

Why Mozilla Matters

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Google Introduces Gemini CLI Open-Source Agent
08/07/2025

Google is introducing Gemini CLI, an open-source AI agent that offers lightweight access to Gemini, Google's conversational chatbot that is based on Google's multimodal large language model [ ... ]

+ Full Story

Parasoft Adds AI Assistant To C/C++ Test
30/06/2025

Parasoft has updated its C/C++ Test software with an AI-powered documentation assistant, along with complete support for MISRA C:2025 and auto-suppression of equivalent violations. C/C++ Test can be u [ ... ]

+ Full Story

More News

Comments

or email your comment to: comments@i-programmer.info

Last Updated ( Wednesday, 29 April 2020 )

Recent Articles

Recent Book Reviews

Popular Articles

More Information

Related Articles

Comments