Python Speech Recognition

The Ultimate Guide To Speech Recognition With Python

Table of Contents

How Speech Recognition Works – An Overview

Picking a python speech recognition package, installing speechrecognition, the recognizer class, supported file types, using record() to capture data from a file, capturing segments with offset and duration, the effect of noise on speech recognition, installing pyaudio, the microphone class, using listen() to capture microphone input, handling unrecognizable speech, putting it all together: a “guess the word” game, recap and additional resources, appendix: recognizing speech in languages other than english.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Speech Recognition With Python

Have you ever wondered how to add speech recognition to your Python project? If so, then keep reading! It’s easier than you might think.

Far from a being a fad, the overwhelming success of speech-enabled products like Amazon Alexa has proven that some degree of speech support will be an essential aspect of household tech for the foreseeable future. If you think about it, the reasons why are pretty obvious. Incorporating speech recognition into your Python application offers a level of interactivity and accessibility that few technologies can match.

The accessibility improvements alone are worth considering. Speech recognition allows the elderly and the physically and visually impaired to interact with state-of-the-art products and services quickly and naturally—no GUI needed!

Best of all, including speech recognition in a Python project is really simple. In this guide, you’ll find out how. You’ll learn:

  • How speech recognition works,
  • What packages are available on PyPI; and
  • How to install and use the SpeechRecognition package—a full-featured and easy-to-use Python speech recognition library.

In the end, you’ll apply what you’ve learned to a simple “Guess the Word” game and see how it all comes together.

Free Bonus: Click here to download a Python speech recognition sample project with full source code that you can use as a basis for your own speech recognition apps.

Before we get to the nitty-gritty of doing speech recognition in Python, let’s take a moment to talk about how speech recognition works. A full discussion would fill a book, so I won’t bore you with all of the technical details here. In fact, this section is not pre-requisite to the rest of the tutorial. If you’d like to get straight to the point, then feel free to skip ahead.

Speech recognition has its roots in research done at Bell Labs in the early 1950s. Early systems were limited to a single speaker and had limited vocabularies of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts. They can recognize speech from multiple speakers and have enormous vocabularies in numerous languages.

The first component of speech recognition is, of course, speech. Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once digitized, several models can be used to transcribe the audio to text.

Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM). This approach works on the assumption that a speech signal, when viewed on a short enough timescale (say, ten milliseconds), can be reasonably approximated as a stationary process—that is, a process in which statistical properties do not change over time.

In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more. The final output of the HMM is a sequence of these vectors.

To decode the speech into text, groups of vectors are matched to one or more phonemes —a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker. A special algorithm is then applied to determine the most likely word (or words) that produce the given sequence of phonemes.

One can imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal.

Fortunately, as a Python programmer, you don’t have to worry about any of this. A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs .

A handful of packages for speech recognition exist on PyPI. A few of them include:

  • google-cloud-speech
  • pocketsphinx
  • SpeechRecognition
  • watson-developer-cloud

Some of these packages—such as wit and apiai—offer built-in features, like natural language processing for identifying a speaker’s intent, which go beyond basic speech recognition. Others, like google-cloud-speech, focus solely on speech-to-text conversion.

There is one package that stands out in terms of ease-of-use: SpeechRecognition.

Recognizing speech requires audio input, and SpeechRecognition makes retrieving this input really easy. Instead of having to build scripts for accessing microphones and processing audio files from scratch, SpeechRecognition will have you up and running in just a few minutes.

The SpeechRecognition library acts as a wrapper for several popular speech APIs and is thus extremely flexible. One of these—the Google Web Speech API—supports a default API key that is hard-coded into the SpeechRecognition library. That means you can get off your feet without having to sign up for a service.

The flexibility and ease-of-use of the SpeechRecognition package make it an excellent choice for any Python project. However, support for every feature of each API it wraps is not guaranteed. You will need to spend some time researching the available options to find out if SpeechRecognition will work in your particular case.

So, now that you’re convinced you should try out SpeechRecognition, the next step is getting it installed in your environment.

SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but requires some additional installation steps for Python 2 . For this tutorial, I’ll assume you are using Python 3.3+.

You can install SpeechRecognition from a terminal with pip:

Once installed, you should verify the installation by opening an interpreter session and typing:

Note: The version number you get might vary. Version 3.8.1 was the latest at the time of writing.

Go ahead and keep this session open. You’ll start to work with it in just a bit.

SpeechRecognition will work out of the box if all you need to do is work with existing audio files. Specific use cases, however, require a few dependencies. Notably, the PyAudio package is needed for capturing microphone input.

You’ll see which dependencies you need as you read further. For now, let’s dive in and explore the basics of the package.

All of the magic in SpeechRecognition happens with the Recognizer class.

The primary purpose of a Recognizer instance is, of course, to recognize speech. Each instance comes with a variety of settings and functionality for recognizing speech from an audio source.

Creating a Recognizer instance is easy. In your current interpreter session, just type:

Each Recognizer instance has seven methods for recognizing speech from an audio source using various APIs. These are:

  • recognize_bing() : Microsoft Bing Speech
  • recognize_google() : Google Web Speech API
  • recognize_google_cloud() : Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify() : Houndify by SoundHound
  • recognize_ibm() : IBM Speech to Text
  • recognize_sphinx() : CMU Sphinx - requires installing PocketSphinx
  • recognize_wit() : Wit.ai

Of the seven, only recognize_sphinx() works offline with the CMU Sphinx engine. The other six all require an internet connection.

A full discussion of the features and benefits of each API is beyond the scope of this tutorial. Since SpeechRecognition ships with a default API key for the Google Web Speech API, you can get started with it right away. For this reason, we’ll use the Web Speech API in this guide. The other six APIs all require authentication with either an API key or a username/password combination. For more information, consult the SpeechRecognition docs .

Caution: The default key provided by SpeechRecognition is for testing purposes only, and Google may revoke it at any time . It is not a good idea to use the Google Web Speech API in production. Even with a valid API key, you’ll be limited to only 50 requests per day, and there is no way to raise this quota . Fortunately, SpeechRecognition’s interface is nearly identical for each API, so what you learn today will be easy to translate to a real-world project.

Each recognize_*() method will throw a speech_recognition.RequestError exception if the API is unreachable. For recognize_sphinx() , this could happen as the result of a missing, corrupt or incompatible Sphinx installation. For the other six methods, RequestError may be thrown if quota limits are met, the server is unavailable, or there is no internet connection.

Ok, enough chit-chat. Let’s get our hands dirty. Go ahead and try to call recognize_google() in your interpreter session.

What happened?

You probably got something that looks like this:

You might have guessed this would happen. How could something be recognized from nothing?

All seven recognize_*() methods of the Recognizer class require an audio_data argument. In each case, audio_data must be an instance of SpeechRecognition’s AudioData class.

There are two ways to create an AudioData instance: from an audio file or audio recorded by a microphone. Audio files are a little easier to get started with, so let’s take a look at that first.

Working With Audio Files

Before you continue, you’ll need to download an audio file. The one I used to get started, “harvard.wav,” can be found here . Make sure you save it to the same directory in which your Python interpreter session is running.

SpeechRecognition makes working with audio files easy thanks to its handy AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.

Currently, SpeechRecognition supports the following file formats:

  • WAV: must be in PCM/LPCM format
  • FLAC: must be native FLAC format; OGG-FLAC is not supported

If you are working on x-86 based Linux, macOS or Windows, you should be able to work with FLAC files without a problem. On other platforms, you will need to install a FLAC encoder and ensure you have access to the flac command line tool. You can find more information here if this applies to you.

Type the following into your interpreter session to process the contents of the “harvard.wav” file:

The context manager opens the file and reads its contents, storing the data in an AudioFile instance called source. Then the record() method records the data from the entire file into an AudioData instance. You can confirm this by checking the type of audio :

You can now invoke recognize_google() to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.

Congratulations! You’ve just transcribed your first audio file!

If you’re wondering where the phrases in the “harvard.wav” file come from, they are examples of Harvard Sentences. These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines. They are still used in VoIP and cellular testing today.

The Harvard Sentences are comprised of 72 lists of ten phrases. You can find freely available recordings of these phrases on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi. They provide an excellent source of free material for testing your code.

What if you only want to capture a portion of the speech in a file? The record() method accepts a duration keyword argument that stops the recording after a specified number of seconds.

For example, the following captures any speech in the first four seconds of the file:

The record() method, when used inside a with block, always moves ahead in the file stream. This means that if you record once for four seconds and then record again for four seconds, the second time returns the four seconds of audio after the first four seconds.

Notice that audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop mid-phrase—or even mid-word—which can hurt the accuracy of the transcription. More on this in a bit.

In addition to specifying a recording duration, the record() method can be given a specific starting point using the offset keyword argument. This value represents the number of seconds from the beginning of the file to ignore before starting to record.

To capture only the second phrase in the file, you could start with an offset of four seconds and record for, say, three seconds.

The offset and duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the structure of the speech in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the following in your interpreter:

By starting the recording at 4.7 seconds, you miss the “it t” portion a the beginning of the phrase “it takes heat to bring out the odor,” so the API only got “akes heat,” which it matched to “Mesquite.”

Similarly, at the end of the recording, you captured “a co,” which is the beginning of the third phrase “a cold dip restores health and zest.” This was matched to “Aiko” by the API.

There is another reason you may get inaccurate transcriptions. Noise! The above examples worked well because the audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you can not expect the audio to be noise-free.

Noise is a fact of life. All audio recordings have some degree of noise in them, and un-handled noise can wreck the accuracy of speech recognition apps.

To get a feel for how noise can affect speech recognition, download the “jackhammer.wav” file here . As always, make sure you save this to your interpreter session’s working directory.

This file has the phrase “the stale smell of old beer lingers” spoken with a loud jackhammer in the background.

What happens when you try to transcribe this file?

So how do you deal with this? One thing you can try is using the adjust_for_ambient_noise() method of the Recognizer class.

That got you a little closer to the actual phrase, but it still isn’t perfect. Also, “the” is missing from the beginning of the phrase. Why is that?

The adjust_for_ambient_noise() method reads the first second of the file stream and calibrates the recognizer to the noise level of the audio. Hence, that portion of the stream is consumed before you call record() to capture the data.

You can adjust the time-frame that adjust_for_ambient_noise() uses for analysis with the duration keyword argument. This argument takes a numerical value in seconds and is set to 1 by default. Try lowering this value to 0.5.

Well, that got you “the” at the beginning of the phrase, but now you have some new issues! Sometimes it isn’t possible to remove the effect of the noise—the signal is just too noisy to be dealt with successfully. That’s the case with this file.

If you find yourself running up against these issues frequently, you may have to resort to some pre-processing of the audio. This can be done with audio editing software or a Python package (such as SciPy ) that can apply filters to the files. A detailed discussion of this is beyond the scope of this tutorial—check out Allen Downey’s Think DSP book if you are interested. For now, just be aware that ambient noise in an audio file can cause problems and must be addressed in order to maximize the accuracy of speech recognition.

When working with noisy files, it can be helpful to see the actual API response. Most APIs return a JSON string containing many possible transcriptions. The recognize_google() method will always return the most likely transcription unless you force it to give you the full response.

You can do this by setting the show_all keyword argument of the recognize_google() method to True.

As you can see, recognize_google() returns a dictionary with the key 'alternative' that points to a list of possible transcripts. The structure of this response may vary from API to API and is mainly useful for debugging.

By now, you have a pretty good idea of the basics of the SpeechRecognition package. You’ve seen how to create an AudioFile instance from an audio file and use the record() method to capture data from the file. You learned how to record segments of a file using the offset and duration keyword arguments of record() , and you experienced the detrimental effect noise can have on transcription accuracy.

Now for the fun part. Let’s transition from transcribing static audio files to making your project interactive by accepting input from a microphone.

Working With Microphones

To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package . Go ahead and close your current interpreter session, and let’s do that.

The process for installing PyAudio will vary depending on your operating system.

Debian Linux

If you’re on Debian-based Linux (like Ubuntu) you can install PyAudio with apt :

Once installed, you may still need to run pip install pyaudio , especially if you are working in a virtual environment .

For macOS, first you will need to install PortAudio with Homebrew, and then install PyAudio with pip :

On Windows, you can install PyAudio with pip :

Testing the Installation

Once you’ve got PyAudio installed, you can test the installation from the console.

Make sure your default microphone is on and unmuted. If the installation worked, you should see something like this:

Shell A moment of silence, please... Set minimum energy threshold to 600.4452854381937 Say something! Copied! Go ahead and play around with it a little bit by speaking into your microphone and seeing how well SpeechRecognition transcribes your speech.

Note: If you are on Ubuntu and get some funky output like ‘ALSA lib … Unknown PCM’, refer to this page for tips on suppressing these messages. This output comes from the ALSA package installed with Ubuntu—not SpeechRecognition or PyAudio. In all reality, these messages may indicate a problem with your ALSA configuration, but in my experience, they do not impact the functionality of your code. They are mostly a nuisance.

Open up another interpreter session and create an instance of the recognizer class.

Now, instead of using an audio file as the source, you will use the default system microphone. You can access this by creating an instance of the Microphone class.

If your system has no default microphone (such as on a Raspberry Pi ), or you want to use a microphone other than the default, you will need to specify which one to use by supplying a device index. You can get a list of microphone names by calling the list_microphone_names() static method of the Microphone class.

Note that your output may differ from the above example.

The device index of the microphone is the index of its name in the list returned by list_microphone_names(). For example, given the above output, if you want to use the microphone called “front,” which has index 3 in the list, you would create a microphone instance like this:

For most projects, though, you’ll probably want to use the default system microphone.

Now that you’ve got a Microphone instance ready to go, it’s time to capture some input.

Just like the AudioFile class, Microphone is a context manager. You can capture input from the microphone using the listen() method of the Recognizer class inside of the with block. This method takes an audio source as its first argument and records input from the source until silence is detected.

Once you execute the with block, try speaking “hello” into your microphone. Wait a moment for the interpreter prompt to display again. Once the “>>>” prompt returns, you’re ready to recognize the speech.

If the prompt never returns, your microphone is most likely picking up too much ambient noise. You can interrupt the process with Ctrl + C to get your prompt back.

To handle ambient noise, you’ll need to use the adjust_for_ambient_noise() method of the Recognizer class, just like you did when trying to make sense of the noisy audio file. Since input from a microphone is far less predictable than input from an audio file, it is a good idea to do this anytime you listen for microphone input.

After running the above code, wait a second for adjust_for_ambient_noise() to do its thing, then try speaking “hello” into the microphone. Again, you will have to wait a moment for the interpreter prompt to return before trying to recognize the speech.

Recall that adjust_for_ambient_noise() analyzes the audio source for one second. If this seems too long to you, feel free to adjust this with the duration keyword argument.

The SpeechRecognition documentation recommends using a duration no less than 0.5 seconds. In some cases, you may find that durations longer than the default of one second generate better results. The minimum value you need depends on the microphone’s ambient environment. Unfortunately, this information is typically unknown during development. In my experience, the default duration of one second is adequate for most applications.

Try typing the previous code example in to the interpeter and making some unintelligible noises into the microphone. You should get something like this in response:

Audio that cannot be matched to text by the API raises an UnknownValueError exception. You should always wrap calls to the API with try and except blocks to handle this exception .

Note : You may have to try harder than you expect to get the exception thrown. The API works very hard to transcribe any vocal sounds. Even short grunts were transcribed as words like “how” for me. Coughing, hand claps, and tongue clicks would consistently raise the exception.

Now that you’ve seen the basics of recognizing speech with the SpeechRecognition package let’s put your newfound knowledge to use and write a small game that picks a random word from a list and gives the user three attempts to guess the word.

Here is the full script:

Let’s break that down a little bit.

The recognize_speech_from_mic() function takes a Recognizer and Microphone instance as arguments and returns a dictionary with three keys. The first key, "success" , is a boolean that indicates whether or not the API request was successful. The second key, "error" , is either None or an error message indicating that the API is unavailable or the speech was unintelligible. Finally, the "transcription" key contains the transcription of the audio recorded by the microphone.

The function first checks that the recognizer and microphone arguments are of the correct type, and raises a TypeError if either is invalid:

The listen() method is then used to record microphone input:

The adjust_for_ambient_noise() method is used to calibrate the recognizer for changing noise conditions each time the recognize_speech_from_mic() function is called.

Next, recognize_google() is called to transcribe any speech in the recording. A try...except block is used to catch the RequestError and UnknownValueError exceptions and handle them accordingly. The success of the API request, any error messages, and the transcribed speech are stored in the success , error and transcription keys of the response dictionary, which is returned by the recognize_speech_from_mic() function.

You can test the recognize_speech_from_mic() function by saving the above script to a file called “guessing_game.py” and running the following in an interpreter session:

The game itself is pretty simple. First, a list of words, a maximum number of allowed guesses and a prompt limit are declared:

Next, a Recognizer and Microphone instance is created and a random word is chosen from WORDS :

After printing some instructions and waiting for 3 three seconds, a for loop is used to manage each user attempt at guessing the chosen word. The first thing inside the for loop is another for loop that prompts the user at most PROMPT_LIMIT times for a guess, attempting to recognize the input each time with the recognize_speech_from_mic() function and storing the dictionary returned to the local variable guess .

If the "transcription" key of guess is not None , then the user’s speech was transcribed and the inner loop is terminated with break . If the speech was not transcribed and the "success" key is set to False , then an API error occurred and the loop is again terminated with break . Otherwise, the API request was successful but the speech was unrecognizable. The user is warned and the for loop repeats, giving the user another chance at the current attempt.

Once the inner for loop terminates, the guess dictionary is checked for errors. If any occurred, the error message is displayed and the outer for loop is terminated with break , which will end the program execution.

If there weren’t any errors, the transcription is compared to the randomly selected word. The lower() method for string objects is used to ensure better matching of the guess to the chosen word. The API may return speech matched to the word “apple” as “Apple” or “apple,” and either response should count as a correct answer.

If the guess was correct, the user wins and the game is terminated. If the user was incorrect and has any remaining attempts, the outer for loop repeats and a new guess is retrieved. Otherwise, the user loses the game.

When run, the output will look something like this:

In this tutorial, you’ve seen how to install the SpeechRecognition package and use its Recognizer class to easily recognize speech from both a file—using record() —and microphone input—using listen(). You also saw how to process segments of an audio file using the offset and duration keyword arguments of the record() method.

You’ve seen the effect noise can have on the accuracy of transcriptions, and have learned how to adjust a Recognizer instance’s sensitivity to ambient noise with adjust_for_ambient_noise(). You have also learned which exceptions a Recognizer instance may throw— RequestError for bad API requests and UnkownValueError for unintelligible speech—and how to handle these with try...except blocks.

Speech recognition is a deep subject, and what you have learned here barely scratches the surface. If you’re interested in learning more, here are some additional resources.

For more information on the SpeechRecognition package:

  • Library reference
  • Troubleshooting page

A few interesting internet resources:

  • Behind the Mic: The Science of Talking with Computers . A short film about speech processing by Google.
  • A Historical Perspective of Speech Recognition by Huang, Baker and Reddy. Communications of the ACM (2014). This article provides an in-depth and scholarly look at the evolution of speech recognition technology.
  • The Past, Present and Future of Speech Recognition Technology by Clark Boyd at The Startup. This blog post presents an overview of speech recognition technology, with some thoughts about the future.

Some good books about speech recognition:

  • The Voice in the Machine: Building Computers That Understand Speech , Pieraccini, MIT Press (2012). An accessible general-audience book covering the history of, as well as modern advances in, speech processing.
  • Fundamentals of Speech Recognition , Rabiner and Juang, Prentice Hall (1993). Rabiner, a researcher at Bell Labs, was instrumental in designing some of the first commercially viable speech recognizers. This book is now over 20 years old, but a lot of the fundamentals remain the same.
  • Automatic Speech Recognition: A Deep Learning Approach , Yu and Deng, Springer (2014). Yu and Deng are researchers at Microsoft and both very active in the field of speech processing. This book covers a lot of modern approaches and cutting-edge research but is not for the mathematically faint-of-heart.

Throughout this tutorial, we’ve been recognizing speech in English, which is the default language for each recognize_*() method of the SpeechRecognition package. However, it is absolutely possible to recognize speech in other languages, and is quite simple to accomplish.

To recognize speech in a different language, set the language keyword argument of the recognize_*() method to a string corresponding to the desired language. Most of the methods accept a BCP-47 language tag, such as 'en-US' for American English, or 'fr-FR' for French. For example, the following recognizes French speech in an audio file:

Only the following methods accept a language keyword argument:

  • recognize_bing()
  • recognize_google()
  • recognize_google_cloud()
  • recognize_ibm()
  • recognize_sphinx()

To find out which language tags are supported by the API you are using, you’ll have to consult the corresponding documentation . A list of tags accepted by recognize_google() can be found in this Stack Overflow answer .

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About David Amos

David Amos

David is a writer, programmer, and mathematician passionate about exploring mathematics through code.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: advanced data-science machine-learning

Recommended Video Course: Speech Recognition With Python

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Download Now

Get a Full Python Speech Recognition Sample Project (Source Code / .zip)

🔒 No spam. We take your privacy seriously.

python speech to text

  • Python Basics
  • Interview Questions
  • Python Quiz
  • Popular Packages
  • Python Projects
  • Practice Python
  • AI With Python
  • Learn Python3
  • Python Automation
  • Python Web Dev
  • DSA with Python
  • Python OOPs
  • Dictionaries
  • Python Projects - Beginner to Advanced

Projects for Beginners

  • Number guessing game in Python 3 and C
  • Python program for word guessing game
  • Hangman Game in Python
  • 21 Number game in Python
  • Mastermind Game using Python
  • 2048 Game in Python
  • Python | Program to implement simple FLAMES game
  • Python | PokĂ©mon Training Game
  • Python program to implement Rock Paper Scissor game
  • Taking Screenshots using pyscreenshot in Python
  • Desktop Notifier in Python
  • Get Live Weather Desktop Notifications Using Python
  • How to use pynput to make a Keylogger?
  • Python - Cows and Bulls game
  • Simple Attendance Tracker using Python
  • Higher-Lower Game with Python
  • Fun Fact Generator Web App in Python
  • Check if two PDF documents are identical with Python
  • Creating payment receipts using Python
  • How To Create a Countdown Timer Using Python?
  • Convert emoji into text in Python
  • Create a Voice Recorder using Python
  • Create a Screen recorder using Python

Projects for Intermediate

  • How to Build a Simple Auto-Login Bot with Python
  • How to make a Twitter Bot in Python?
  • Building WhatsApp bot on Python
  • Create a Telegram Bot using Python
  • Twitter Sentiment Analysis using Python
  • Employee Management System using Python
  • How to make a Python auto clicker?
  • Instagram Bot using Python and InstaPy
  • File Sharing App using Python
  • Send message to Telegram user using Python
  • Python | Whatsapp birthday bot
  • Corona HelpBot
  • Amazon product availability checker using Python
  • Python | Fetch your gmail emails from a particular user
  • How to Create a Chatbot in Android with BrainShop API?
  • Spam bot using PyAutoGUI
  • Hotel Management System

Web Scraping

  • Build a COVID19 Vaccine Tracker Using Python
  • Email Id Extractor Project from sites in Scrapy Python
  • Automating Scrolling using Python-Opencv by Color Detection
  • How to scrape data from google maps using Python ?
  • Scraping weather data using Python to get umbrella reminder on email
  • Scraping Reddit using Python
  • How to fetch data from Jira in Python?
  • Scrape most reviewed news and tweet using Python
  • Extraction of Tweets using Tweepy
  • Predicting Air Quality Index using Python
  • Scrape content from dynamic websites

Automating boring Stuff Using Python

  • Automate Instagram Messages using Python
  • Python | Automating Happy Birthday post on Facebook using Selenium
  • Automatic Birthday mail sending with Python
  • Automated software testing with Python
  • Python | Automate Google Search using Selenium
  • Automate linkedin connections using Python
  • Automated Trading using Python
  • Automate the Conversion from Python2 to Python3
  • Bulk Posting on Facebook Pages using Selenium
  • Share WhatsApp Web without Scanning QR code using Python
  • Automate WhatsApp Messages With Python using Pywhatkit module
  • How to Send Automated Email Messages in Python
  • Automate backup with Python Script
  • Hotword detection with Python

Tkinter Projects

  • Create First GUI Application using Python-Tkinter
  • Python | Simple GUI calculator using Tkinter
  • Python - Compound Interest GUI Calculator using Tkinter
  • Python | Loan calculator using Tkinter
  • Rank Based Percentile Gui Calculator using Tkinter
  • Standard GUI Unit Converter using Tkinter in Python
  • Create Table Using Tkinter
  • Python | GUI Calendar using Tkinter
  • File Explorer in Python using Tkinter
  • Python | ToDo GUI Application using Tkinter
  • Python: Weight Conversion GUI using Tkinter
  • Python: Age Calculator using Tkinter
  • Python | Create a GUI Marksheet using Tkinter
  • Python | Create a digital clock using Tkinter
  • Create Countdown Timer using Python-Tkinter
  • Tkinter Application to Switch Between Different Page Frames
  • Color game using Tkinter in Python
  • Python | Simple FLAMES game using Tkinter
  • Simple registration form using Python Tkinter
  • Image Viewer App in Python using Tkinter
  • How to create a COVID19 Data Representation GUI?
  • Create GUI for Downloading Youtube Video using Python
  • GUI to Shutdown, Restart and Logout from the PC using Python
  • Create a GUI to extract Lyrics from song Using Python
  • Application to get live USD/INR rate Using Python
  • Build an Application for Screen Rotation Using Python
  • Build an Application to Search Installed Application using Python
  • Text detection using Python
  • Python - Spell Corrector GUI using Tkinter
  • Make Notepad using Tkinter
  • Sentiment Detector GUI using Tkinter - Python
  • Create a GUI for Weather Forecast using openweathermap API in Python
  • Build a Voice Recorder GUI using Python
  • Create a Sideshow application in Python
  • Visiting Card Scanner GUI Application using Python

Turtle Projects

  • Create digital clock using Python-Turtle
  • Draw a Tic Tac Toe Board using Python-Turtle
  • Draw Chess Board Using Turtle in Python
  • Draw an Olympic Symbol in Python using Turtle
  • Draw Rainbow using Turtle Graphics in Python
  • How to make an Indian Flag using Turtle - Python
  • Draw moving object using Turtle in Python
  • Create a simple Animation using Turtle in Python
  • Create a Simple Two Player Game using Turtle in Python
  • Flipping Tiles (memory game) using Python3
  • Create pong game using Python - Turtle

OpenCV Projects

  • Python | Program to extract frames using OpenCV
  • Displaying the coordinates of the points clicked on the image using Python-OpenCV
  • White and black dot detection using OpenCV | Python
  • Python | OpenCV BGR color palette with trackbars
  • Draw a rectangular shape and extract objects using Python's OpenCV
  • Drawing with Mouse on Images using Python-OpenCV
  • Text Detection and Extraction using OpenCV and OCR
  • Invisible Cloak using OpenCV | Python Project
  • Background subtraction - OpenCV
  • ML | Unsupervised Face Clustering Pipeline
  • Pedestrian Detection using OpenCV-Python
  • Saving Operated Video from a webcam using OpenCV
  • Face Detection using Python and OpenCV with webcam
  • Gun Detection using Python-OpenCV
  • Multiple Color Detection in Real-Time using Python-OpenCV
  • Detecting objects of similar color in Python using OpenCV
  • Opening multiple color windows to capture using OpenCV in Python
  • Python | Play a video in reverse mode using OpenCV
  • Template matching using OpenCV in Python
  • Cartooning an Image using OpenCV - Python
  • Vehicle detection using OpenCV Python
  • Count number of Faces using Python - OpenCV
  • Live Webcam Drawing using OpenCV
  • Detect and Recognize Car License Plate from a video in real time
  • Track objects with Camshift using OpenCV
  • Replace Green Screen using OpenCV- Python
  • Python - Eye blink detection project
  • Connect your android phone camera to OpenCV - Python
  • Determine The Face Tilt Using OpenCV - Python
  • Right and Left Hand Detection Using Python
  • Brightness Control With Hand Detection using OpenCV in Python
  • Creating a Finger Counter Using Computer Vision and OpenCv in Python

Python Django Projects

  • Python Web Development With Django
  • How to Create an App in Django ?
  • Weather app using Django | Python
  • Django Sign Up and login with confirmation Email | Python
  • ToDo webapp using Django
  • Setup Sending Email in Django Project
  • Django project to create a Comments System
  • Voting System Project Using Django Framework
  • How to add Google reCAPTCHA to Django forms ?
  • Youtube video downloader using Django
  • E-commerce Website using Django
  • College Management System using Django - Python Project
  • Create Word Counter app using Django

Python Text to Speech and Vice-Versa

  • Speak the meaning of the word using Python
  • Convert PDF File Text to Audio Speech using Python
  • Speech Recognition in Python using Google Speech API
  • Convert Text to Speech in Python
  • Python Text To Speech | pyttsx module

Python: Convert Speech to text and text to Speech

  • Personal Voice Assistant in Python
  • Build a Virtual Assistant Using Python
  • Python | Create a simple assistant using Wolfram Alpha API.
  • Voice Assistant using python
  • Voice search Wikipedia using Python
  • Language Translator Using Google API in Python
  • How to make a voice assistant for E-mail in Python?
  • Voice Assistant for Movies using Python

More Projects on Python

  • Tic Tac Toe GUI In Python using PyGame
  • 8-bit game using pygame
  • Bubble sort visualizer using PyGame
  • Caller ID Lookup using Python
  • Tweet using Python
  • How to make Flappy Bird Game in Pygame?
  • Face Mask detection and Thermal scanner for Covid care - Python Project
  • Personalized Task Manager in Python
  • Pollution Control by Identifying Potential Land for Afforestation - Python Project
  • Human Scream Detection and Analysis for Controlling Crime Rate - Project Idea
  • Download Instagram profile pic using Python

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python. Installation required:    

  • Python Speech Recognition module:    
  • PyAudio: Use the following command for linux users   
  • Windows users can install pyaudio by executing the following command in a terminal   
  • Python pyttsx3 module:    

Speech Input Using a Microphone and Translation of Speech to Text    

  • Allow Adjusting for Ambient Noise: Since the surrounding noise varies, we must allow the program a second or too to adjust the energy threshold of recording so it is adjusted according to the external noise level.   
  • Speech to text translation: This is done with the help of Google Speech Recognition. This requires an active internet connection to work. However, there are certain offline Recognition systems such as PocketSphinx, but have a very rigorous installation process that requires several dependencies. Google Speech Recognition is one of the easiest to use.   

Translation of Speech to Text: First, we need to import the library and then initialize it using init() function. This function may take 2 arguments.   

  • drivername: [Name of available driver] sapi5 on Windows | nsss on MacOS   
  • debug: to enable or disable debug output   

After initialization, we will make the program speak the text using say() function.  This method may also take 2 arguments.   

  • text: Any text you wish to hear.   
  • name: To set a name for this speech. (optional)   

Finally, to run the speech we use runAndWait() All the say() texts won’t be said unless the interpreter encounters runAndWait(). Below is the implementation.  

                   

Please Login to comment...

Similar reads.

  • python-utility

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

SpeechRecognition 3.10.4

pip install SpeechRecognition Copy PIP instructions

Released: May 5, 2024

Library for performing speech recognition, with support for several engines and APIs, online and offline.

Verified details

Maintainers.

Avatar for Anthony.Zhang from gravatar.com

Unverified details

Project links, github statistics.

  • Open issues:

View statistics for this project via Libraries.io , or by using our public dataset on Google BigQuery

License: BSD License (BSD)

Author: Anthony Zhang (Uberi)

Tags speech, recognition, voice, sphinx, google, wit, bing, api, houndify, ibm, snowboy

Requires: Python >=3.8

Classifiers

  • 5 - Production/Stable
  • OSI Approved :: BSD License
  • MacOS :: MacOS X
  • Microsoft :: Windows
  • POSIX :: Linux
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Multimedia :: Sound/Audio :: Speech
  • Software Development :: Libraries :: Python Modules

Project description

Latest Version

UPDATE 2022-02-09 : Hey everyone! This project started as a tech demo, but these days it needs more time than I have to keep up with all the PRs and issues. Therefore, I’d like to put out an open invite for collaborators - just reach out at me @ anthonyz . ca if you’re interested!

Speech recognition engine/API support:

Quickstart: pip install SpeechRecognition . See the “Installing” section for more details.

To quickly try it out, run python -m speech_recognition after installing.

Project links:

Library Reference

The library reference documents every publicly accessible object in the library. This document is also included under reference/library-reference.rst .

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst .

You have to install Vosk models for using Vosk. Here are models avaiable. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

See the examples/ directory in the repository root for usage examples:

First, make sure you have all the requirements listed in the “Requirements” section.

The easiest way to install this is using pip install SpeechRecognition .

Otherwise, download the source distribution from PyPI , and extract the archive.

In the folder, run python setup.py install .

Requirements

To use all of the functionality of the library, you should have:

The following requirements are optional, but can improve or extend functionality in some situations:

The following sections go over the details of each requirement.

The first software requirement is Python 3.8+ . This is required to use the library.

PyAudio (for microphone users)

PyAudio is required if and only if you want to use microphone input ( Microphone ). PyAudio version 0.2.11+ is required, as earlier versions have known memory management bugs when recording from microphones in certain situations.

If not installed, everything in the library will still work, except attempting to instantiate a Microphone object will raise an AttributeError .

The installation instructions on the PyAudio website are quite good - for convenience, they are summarized below:

PyAudio wheel packages for common 64-bit Python versions on Windows and Linux are included for convenience, under the third-party/ directory in the repository root. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the repository root directory .

PocketSphinx-Python (for Sphinx users)

PocketSphinx-Python is required if and only if you want to use the Sphinx recognizer ( recognizer_instance.recognize_sphinx ).

PocketSphinx-Python wheel packages for 64-bit Python 3.4, and 3.5 on Windows are included for convenience, under the third-party/ directory . To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the SpeechRecognition folder.

On Linux and other POSIX systems (such as OS X), follow the instructions under “Building PocketSphinx-Python from source” in Notes on using PocketSphinx for installation instructions.

Note that the versions available in most package repositories are outdated and will not work with the bundled language data. Using the bundled wheel packages or building from source is recommended.

Vosk (for Vosk users)

Vosk API is required if and only if you want to use Vosk recognizer ( recognizer_instance.recognize_vosk ).

You can install it with python3 -m pip install vosk .

You also have to install Vosk Models:

Here are models avaiable for download. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

Google Cloud Speech Library for Python (for Google Cloud Speech API users)

Google Cloud Speech library for Python is required if and only if you want to use the Google Cloud Speech API ( recognizer_instance.recognize_google_cloud ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_google_cloud will raise an RequestError .

According to the official installation instructions , the recommended way to install this is using Pip : execute pip install google-cloud-speech (replace pip with pip3 if using Python 3).

FLAC (for some systems)

A FLAC encoder is required to encode the audio data to send to the API. If using Windows (x86 or x86-64), OS X (Intel Macs only, OS X 10.6 or higher), or Linux (x86 or x86-64), this is already bundled with this library - you do not need to install anything .

Otherwise, ensure that you have the flac command line tool, which is often available through the system package manager. For example, this would usually be sudo apt-get install flac on Debian-derivatives, or brew install flac on OS X with Homebrew.

Whisper (for Whisper users)

Whisper is required if and only if you want to use whisper ( recognizer_instance.recognize_whisper ).

You can install it with python3 -m pip install SpeechRecognition[whisper-local] .

Whisper API (for Whisper API users)

The library openai is required if and only if you want to use Whisper API ( recognizer_instance.recognize_whisper_api ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_whisper_api will raise an RequestError .

You can install it with python3 -m pip install SpeechRecognition[whisper-api] .

Troubleshooting

The recognizer tries to recognize speech even when i’m not speaking, or after i’m done speaking..

Try increasing the recognizer_instance.energy_threshold property. This is basically how sensitive the recognizer is to when recognition should start. Higher values mean that it will be less sensitive, which is useful if you are in a loud room.

This value depends entirely on your microphone or audio data. There is no one-size-fits-all value, but good values typically range from 50 to 4000.

Also, check on your microphone volume settings. If it is too sensitive, the microphone may be picking up a lot of ambient noise. If it is too insensitive, the microphone may be rejecting speech as just noise.

The recognizer can’t recognize speech right after it starts listening for the first time.

The recognizer_instance.energy_threshold property is probably set to a value that is too high to start off with, and then being adjusted lower automatically by dynamic energy threshold adjustment. Before it is at a good level, the energy threshold is so high that speech is just considered ambient noise.

The solution is to decrease this threshold, or call recognizer_instance.adjust_for_ambient_noise beforehand, which will set the threshold to a good value automatically.

The recognizer doesn’t understand my particular language/dialect.

Try setting the recognition language to your language/dialect. To do this, see the documentation for recognizer_instance.recognize_sphinx , recognizer_instance.recognize_google , recognizer_instance.recognize_wit , recognizer_instance.recognize_bing , recognizer_instance.recognize_api , recognizer_instance.recognize_houndify , and recognizer_instance.recognize_ibm .

For example, if your language/dialect is British English, it is better to use "en-GB" as the language rather than "en-US" .

The recognizer hangs on recognizer_instance.listen ; specifically, when it’s calling Microphone.MicrophoneStream.read .

This usually happens when you’re using a Raspberry Pi board, which doesn’t have audio input capabilities by itself. This causes the default microphone used by PyAudio to simply block when we try to read it. If you happen to be using a Raspberry Pi, you’ll need a USB sound card (or USB microphone).

Once you do this, change all instances of Microphone() to Microphone(device_index=MICROPHONE_INDEX) , where MICROPHONE_INDEX is the hardware-specific index of the microphone.

To figure out what the value of MICROPHONE_INDEX should be, run the following code:

This will print out something like the following:

Now, to use the Snowball microphone, you would change Microphone() to Microphone(device_index=3) .

Calling Microphone() gives the error IOError: No Default Input Device Available .

As the error says, the program doesn’t know which microphone to use.

To proceed, either use Microphone(device_index=MICROPHONE_INDEX, ...) instead of Microphone(...) , or set a default microphone in your OS. You can obtain possible values of MICROPHONE_INDEX using the code in the troubleshooting entry right above this one.

The program doesn’t run when compiled with PyInstaller .

As of PyInstaller version 3.0, SpeechRecognition is supported out of the box. If you’re getting weird issues when compiling your program using PyInstaller, simply update PyInstaller.

You can easily do this by running pip install --upgrade pyinstaller .

On Ubuntu/Debian, I get annoying output in the terminal saying things like “bt_audio_service_open: [
] Connection refused” and various others.

The “bt_audio_service_open” error means that you have a Bluetooth audio device, but as a physical device is not currently connected, we can’t actually use it - if you’re not using a Bluetooth microphone, then this can be safely ignored. If you are, and audio isn’t working, then double check to make sure your microphone is actually connected. There does not seem to be a simple way to disable these messages.

For errors of the form “ALSA lib [
] Unknown PCM”, see this StackOverflow answer . Basically, to get rid of an error of the form “Unknown PCM cards.pcm.rear”, simply comment out pcm.rear cards.pcm.rear in /usr/share/alsa/alsa.conf , ~/.asoundrc , and /etc/asound.conf .

For “jack server is not running or cannot be started” or “connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)” or “attempt to connect to server failed”, these are caused by ALSA trying to connect to JACK, and can be safely ignored. I’m not aware of any simple way to turn those messages off at this time, besides entirely disabling printing while starting the microphone .

On OS X, I get a ChildProcessError saying that it couldn’t find the system FLAC converter, even though it’s installed.

Installing FLAC for OS X directly from the source code will not work, since it doesn’t correctly add the executables to the search path.

Installing FLAC using Homebrew ensures that the search path is correctly updated. First, ensure you have Homebrew, then run brew install flac to install the necessary files.

To hack on this library, first make sure you have all the requirements listed in the “Requirements” section.

To install/reinstall the library locally, run python -m pip install -e .[dev] in the project root directory .

Before a release, the version number is bumped in README.rst and speech_recognition/__init__.py . Version tags are then created using git config gpg.program gpg2 && git config user.signingkey DB45F6C431DE7C2DCD99FF7904882258A4063489 && git tag -s VERSION_GOES_HERE -m "Version VERSION_GOES_HERE" .

Releases are done by running make-release.sh VERSION_GOES_HERE to build the Python source packages, sign them, and upload them to PyPI.

To run all the tests:

To run static analysis:

To ensure RST is well-formed:

Testing is also done automatically by GitHub Actions, upon every push.

FLAC Executables

The included flac-win32 executable is the official FLAC 1.3.2 32-bit Windows binary .

The included flac-linux-x86 and flac-linux-x86_64 executables are built from the FLAC 1.3.2 source code with Manylinux to ensure that it’s compatible with a wide variety of distributions.

The built FLAC executables should be bit-for-bit reproducible. To rebuild them, run the following inside the project directory on a Debian-like system:

The included flac-mac executable is extracted from xACT 2.39 , which is a frontend for FLAC 1.3.2 that conveniently includes binaries for all of its encoders. Specifically, it is a copy of xACT 2.39/xACT.app/Contents/Resources/flac in xACT2.39.zip .

Please report bugs and suggestions at the issue tracker !

How to cite this library (APA style):

Zhang, A. (2017). Speech Recognition (Version 3.8) [Software]. Available from https://github.com/Uberi/speech_recognition#readme .

How to cite this library (Chicago style):

Zhang, Anthony. 2017. Speech Recognition (version 3.8).

Also check out the Python Baidu Yuyin API , which is based on an older version of this project, and adds support for Baidu Yuyin . Note that Baidu Yuyin is only available inside China.

Copyright 2014-2017 Anthony Zhang (Uberi) . The source code for this library is available online at GitHub .

SpeechRecognition is made available under the 3-clause BSD license. See LICENSE.txt in the project’s root directory for more information.

For convenience, all the official distributions of SpeechRecognition already include a copy of the necessary copyright notices and licenses. In your project, you can simply say that licensing information for SpeechRecognition can be found within the SpeechRecognition README, and make sure SpeechRecognition is visible to users if they wish to see it .

SpeechRecognition distributes source code, binaries, and language files from CMU Sphinx . These files are BSD-licensed and redistributable as long as copyright notices are correctly retained. See speech_recognition/pocketsphinx-data/*/LICENSE*.txt and third-party/LICENSE-Sphinx.txt for license details for individual parts.

SpeechRecognition distributes source code and binaries from PyAudio . These files are MIT-licensed and redistributable as long as copyright notices are correctly retained. See third-party/LICENSE-PyAudio.txt for license details.

SpeechRecognition distributes binaries from FLAC - speech_recognition/flac-win32.exe , speech_recognition/flac-linux-x86 , and speech_recognition/flac-mac . These files are GPLv2-licensed and redistributable, as long as the terms of the GPL are satisfied. The FLAC binaries are an aggregate of separate programs , so these GPL restrictions do not apply to the library or your programs that use the library, only to FLAC itself. See LICENSE-FLAC.txt for license details.

Project details

Release history release notifications | rss feed.

May 5, 2024

Mar 30, 2024

Mar 28, 2024

Dec 6, 2023

Mar 13, 2023

Dec 4, 2022

Dec 5, 2017

Jun 27, 2017

Apr 13, 2017

Mar 11, 2017

Jan 7, 2017

Nov 21, 2016

May 22, 2016

May 11, 2016

May 10, 2016

Apr 9, 2016

Apr 4, 2016

Apr 3, 2016

Mar 5, 2016

Mar 4, 2016

Feb 26, 2016

Feb 20, 2016

Feb 19, 2016

Feb 4, 2016

Nov 5, 2015

Nov 2, 2015

Sep 2, 2015

Sep 1, 2015

Aug 30, 2015

Aug 24, 2015

Jul 26, 2015

Jul 12, 2015

Jul 3, 2015

May 20, 2015

Apr 24, 2015

Apr 14, 2015

Apr 7, 2015

Apr 5, 2015

Apr 4, 2015

Mar 31, 2015

Dec 10, 2014

Nov 17, 2014

Sep 11, 2014

Sep 6, 2014

Aug 25, 2014

Jul 6, 2014

Jun 10, 2014

Jun 9, 2014

May 29, 2014

Apr 23, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded May 5, 2024 Source

Built Distribution

Uploaded May 5, 2024 Python 2 Python 3

Hashes for speechrecognition-3.10.4.tar.gz

Hashes for speechrecognition-3.10.4.tar.gz
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256

Hashes for SpeechRecognition-3.10.4-py2.py3-none-any.whl

Hashes for SpeechRecognition-3.10.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256
MD5
BLAKE2b-256
  • portuguĂȘs (Brasil)

Supported by

python speech to text

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription.

KoljaB/RealtimeSTT

Folders and files.

NameName
86 Commits

Repository files navigation

Realtimestt.

Easy-to-use, low-latency speech-to-text library for realtime applications

About the Project

RealtimeSTT listens to the microphone and transcribes voice into text.

Hint: Check out Linguflex , the original project from which RealtimeSTT is spun off. It lets you control your environment by speaking and is one of the most capable and sophisticated open-source assistants currently available.

It's ideal for:

  • Voice Assistants
  • Applications requiring fast and precise speech-to-text conversion

Latest Version: v0.1.15

See release history .

Hint: Since we use the multiprocessing module now, ensure to include the if __name__ == '__main__': protection in your code to prevent unexpected behavior, especially on platforms like Windows. For a detailed explanation on why this is important, visit the official Python documentation on multiprocessing .
  • Voice Activity Detection : Automatically detects when you start and stop speaking.
  • Realtime Transcription : Transforms speech to text in real-time.
  • Wake Word Activation : Can activate upon detecting a designated wake word.
Hint : Check out RealtimeTTS , the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.

This library uses:

  • WebRTCVAD for initial voice activity detection.
  • SileroVAD for more accurate verification.
  • Faster_Whisper for instant (GPU-accelerated) transcription.
  • Porcupine for wake word detection.

These components represent the "industry standard" for cutting-edge applications, providing the most modern and effective foundation for building high-end solutions.

Installation

This will install all the necessary dependencies, including a CPU support only version of PyTorch.

Although it is possible to run RealtimeSTT with a CPU installation only (use a small model like "tiny" or "base" in this case) you will get way better experience using:

GPU Support with CUDA (recommended)

Additional steps are needed for a GPU-optimized installation. These steps are recommended for those who require better performance and have a compatible NVIDIA GPU.

Note : To check if your NVIDIA GPU supports CUDA, visit the official CUDA GPUs list .

To use RealtimeSTT with GPU support via CUDA please follow these steps:

Install NVIDIA CUDA Toolkit 11.8 :

  • Visit NVIDIA CUDA Toolkit Archive .
  • Select operating system and version.
  • Download and install the software.

Install NVIDIA cuDNN 8.7.0 for CUDA 11.x :

  • Visit NVIDIA cuDNN Archive .
  • Click on "Download cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x".

Install ffmpeg :

Note : Installation of ffmpeg might not actually be needed to operate RealtimeSTT *thanks to jgilbert2017 for pointing this out

You can download an installer for your OS from the ffmpeg Website .

Or use a package manager:

On Ubuntu or Debian :

On Arch Linux :

On MacOS using Homebrew ( https://brew.sh/ ):

On Windows using Winget official documentation :

On Windows using Chocolatey ( https://chocolatey.org/ ):

On Windows using Scoop ( https://scoop.sh/ ):

Install PyTorch with CUDA support :

Quick Start

Basic usage:

Manual Recording

Start and stop of recording are manually triggered.

Automatic Recording

Recording based on voice activity detection.

When running recorder.text in a loop it is recommended to use a callback, allowing the transcription to be run asynchronously:

Keyword activation before detecting voice. Write the comma-separated list of your desired activation keywords into the wake_words parameter. You can choose wake words from these list: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator.

You can set callback functions to be executed on different events (see Configuration ) :

Feed chunks

If you don't want to use the local microphone set use_microphone parameter to false and provide raw PCM audiochunks in 16-bit mono (samplerate 16000) with this method:

You can shutdown the recorder safely by using the context manager protocol:

Or you can call the shutdown method manually (if using "with" is not feasible):

Testing the Library

The test subdirectory contains a set of scripts to help you evaluate and understand the capabilities of the RealtimeTTS library.

Test scripts depending on RealtimeTTS library may require you to enter your azure service region within the script. When using OpenAI-, Azure- or Elevenlabs-related demo scripts the API Keys should be provided in the environment variables OPENAI_API_KEY, AZURE_SPEECH_KEY and ELEVENLABS_API_KEY (see RealtimeTTS )

simple_test.py

  • Description : A "hello world" styled demonstration of the library's simplest usage.

realtimestt_test.py

  • Description : Showcasing live-transcription.

wakeword_test.py

  • Description : A demonstration of the wakeword activation.

translator.py

  • Dependencies : Run pip install openai realtimetts .
  • Description : Real-time translations into six different languages.

openai_voice_interface.py

  • Description : Wake word activated and voice based user interface to the OpenAI API.

advanced_talk.py

  • Dependencies : Run pip install openai keyboard realtimetts .
  • Description : Choose TTS engine and voice before starting AI conversation.

minimalistic_talkbot.py

  • Description : A basic talkbot in 20 lines of code.

The example_app subdirectory contains a polished user interface application for the OpenAI API based on PyQt5.

Configuration

Initialization parameters for audiototextrecorder.

When you initialize the AudioToTextRecorder class, you have various options to customize its behavior.

General Parameters

model (str, default="tiny"): Model size or path for transcription.

  • Options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.
  • Note: If a size is provided, the model will be downloaded from the Hugging Face Hub.

language (str, default=""): Language code for transcription. If left empty, the model will try to auto-detect the language. Supported language codes are listed in Whisper Tokenizer library .

compute_type (str, default="default"): Specifies the type of computation to be used for transcription. See Whisper Quantization

input_device_index (int, default=0): Audio Input Device Index to use.

gpu_device_index (int, default=0): GPU Device Index to use. The model can also be loaded on multiple GPUs by passing a list of IDs (e.g. [0, 1, 2, 3]).

device (str, default="cuda"): Device for model to use. Can either be "cuda" or "cpu".

on_recording_start : A callable function triggered when recording starts.

on_recording_stop : A callable function triggered when recording ends.

on_transcription_start : A callable function triggered when transcription starts.

ensure_sentence_starting_uppercase (bool, default=True): Ensures that every sentence detected by the algorithm starts with an uppercase letter.

ensure_sentence_ends_with_period (bool, default=True): Ensures that every sentence that doesn't end with punctuation such as "?", "!" ends with a period

use_microphone (bool, default=True): Usage of local microphone for transcription. Set to False if you want to provide chunks with feed_audio method.

spinner (bool, default=True): Provides a spinner animation text with information about the current recorder state.

level (int, default=logging.WARNING): Logging level.

handle_buffer_overflow (bool, default=True): If set, the system will log a warning when an input overflow occurs during recording and remove the data from the buffer.

beam_size (int, default=5): The beam size to use for beam search decoding.

initial_prompt (str or iterable of int, default=None): Initial prompt to be fed to the transcription models.

suppress_tokens (list of int, default=[-1]): Tokens to be suppressed from the transcription output.

on_recorded_chunk : A callback function that is triggered when a chunk of audio is recorded. Submits the chunk data as parameter.

debug_mode (bool, default=False): If set, the system prints additional debug information to the console.

Real-time Transcription Parameters

Note : When enabling realtime description a GPU installation is strongly advised. Using realtime transcription may create high GPU loads.

enable_realtime_transcription (bool, default=False): Enables or disables real-time transcription of audio. When set to True, the audio will be transcribed continuously as it is being recorded.

realtime_model_type (str, default="tiny"): Specifies the size or path of the machine learning model to be used for real-time transcription.

  • Valid options: 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2'.

realtime_processing_pause (float, default=0.2): Specifies the time interval in seconds after a chunk of audio gets transcribed. Lower values will result in more "real-time" (frequent) transcription updates but may increase computational load.

on_realtime_transcription_update : A callback function that is triggered whenever there's an update in the real-time transcription. The function is called with the newly transcribed text as its argument.

on_realtime_transcription_stabilized : A callback function that is triggered whenever there's an update in the real-time transcription and returns a higher quality, stabilized text as its argument.

beam_size_realtime (int, default=3): The beam size to use for real-time transcription beam search decoding.

Voice Activation Parameters

silero_sensitivity (float, default=0.6): Sensitivity for Silero's voice activity detection ranging from 0 (least sensitive) to 1 (most sensitive). Default is 0.6.

silero_use_onnx (bool, default=False): Enables usage of the pre-trained model from Silero in the ONNX (Open Neural Network Exchange) format instead of the PyTorch format. Default is False. Recommended for faster performance.

webrtc_sensitivity (int, default=3): Sensitivity for the WebRTC Voice Activity Detection engine ranging from 0 (least aggressive / most sensitive) to 3 (most aggressive, least sensitive). Default is 3.

post_speech_silence_duration (float, default=0.2): Duration in seconds of silence that must follow speech before the recording is considered to be completed. This ensures that any brief pauses during speech don't prematurely end the recording.

min_gap_between_recordings (float, default=1.0): Specifies the minimum time interval in seconds that should exist between the end of one recording session and the beginning of another to prevent rapid consecutive recordings.

min_length_of_recording (float, default=1.0): Specifies the minimum duration in seconds that a recording session should last to ensure meaningful audio capture, preventing excessively short or fragmented recordings.

pre_recording_buffer_duration (float, default=0.2): The time span, in seconds, during which audio is buffered prior to formal recording. This helps counterbalancing the latency inherent in speech activity detection, ensuring no initial audio is missed.

on_vad_detect_start : A callable function triggered when the system starts to listen for voice activity.

on_vad_detect_stop : A callable function triggered when the system stops to listen for voice activity.

Wake Word Parameters

wake_words (str, default=""): Wake words for initiating the recording. Multiple wake words can be provided as a comma-separated string. Supported wake words are: alexa, americano, blueberry, bumblebee, computer, grapefruits, grasshopper, hey google, hey siri, jarvis, ok google, picovoice, porcupine, terminator

wake_words_sensitivity (float, default=0.6): Sensitivity level for wake word detection (0 for least sensitive, 1 for most sensitive).

wake_word_activation_delay (float, default=0): Duration in seconds after the start of monitoring before the system switches to wake word activation if no voice is initially detected. If set to zero, the system uses wake word activation immediately.

wake_word_timeout (float, default=5): Duration in seconds after a wake word is recognized. If no subsequent voice activity is detected within this window, the system transitions back to an inactive state, awaiting the next wake word or voice activation.

on_wakeword_detected : A callable function triggered when a wake word is detected.

on_wakeword_timeout : A callable function triggered when the system goes back to an inactive state after when no speech was detected after wake word activation.

on_wakeword_detection_start : A callable function triggered when the system starts to listen for wake words

on_wakeword_detection_end : A callable function triggered when stopping to listen for wake words (e.g. because of timeout or wake word detected)

Contribution

Contributions are always welcome!

Shoutout to Steven Linn for providing docker support.

Kolja Beigel Email: [email protected] GitHub

Releases 11

Contributors 6.

@KoljaB

  • Python 93.9%
  • JavaScript 3.0%
  • Batchfile 1.6%
  • PortuguĂȘs – Brasil

Using the Speech-to-Text API with Python

1. overview.

9e7124a578332fed.png

The Speech-to-Text API enables developers to convert audio to text in over 125 languages and variants, by applying powerful neural network models in an easy to use API.

In this tutorial, you will focus on using the Speech-to-Text API with Python.

What you'll learn

  • How to set up your environment
  • How to transcribe audio files in English
  • How to transcribe audio files with word timestamps
  • How to transcribe audio files in different languages

What you'll need

  • A Google Cloud project
  • A browser, such as Chrome or Firefox
  • Familiarity using Python

How will you use this tutorial?

How would you rate your experience with python, how would you rate your experience with google cloud services, 2. setup and requirements, self-paced environment setup.

  • Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one .

fbef9caa1602edd0.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
  • The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID ). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
  • For your information, there is a third value, a Project Number , which some APIs use. Learn more about all three of these values in the documentation .
  • Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Cloud Shell , a command line environment running in the Cloud.

Activate Cloud Shell

853e55310c205094.png

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue .

9c92662c6a846a5c.png

It should only take a few moments to provision and connect to Cloud Shell.

9f0e51b578fecce5.png

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

  • Run the following command in Cloud Shell to confirm that you are authenticated:

Command output

  • Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:

If it is not, you can set it with this command:

3. Environment setup

Before you can begin using the Speech-to-Text API, run the following command in Cloud Shell to enable the API:

You should see something like this:

Now, you can use the Speech-to-Text API!

Navigate to your home directory:

Create a Python virtual environment to isolate the dependencies:

Activate the virtual environment:

Install IPython and the Speech-to-Text API client library:

Now, you're ready to use the Speech-to-Text API client library!

In the next steps, you'll use an interactive Python interpreter called IPython , which you installed in the previous step. Start a session by running ipython in Cloud Shell:

You're ready to make your first request...

4. Transcribe audio files

In this section, you will transcribe an English audio file.

Copy the following code into your IPython session:

Take a moment to study the code and see how it uses the recognize client library method to transcribe an audio file*.* The config parameter indicates how to process the request and the audio parameter specifies the audio data to be recognized.

Send a request:

You should see the following output:

Update the configuration to enable automatic punctuation and send a new request:

In this step, you were able to transcribe an audio file in English, using different parameters, and print out the result. You can read more about transcribing audio files .

5. Get word timestamps

Speech-to-Text can detect time offsets (timestamps) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

To transcribe an audio file with word timestamps, update your code by copying the following into your IPython session:

Take a moment to study the code and see how it transcribes an audio file with word timestamps*.* The enable_word_time_offsets parameter tells the API to return the time offsets for each word (see the doc for more details).

In this step, you were able to transcribe an audio file in English with word timestamps and print the result. Read more about getting word timestamps .

6. Transcribe different languages

The Speech-to-Text API recognizes more than 125 languages and variants! You can find a list of supported languages here .

In this section, you will transcribe a French audio file.

To transcribe the French audio file, update your code by copying the following into your IPython session:

In this step, you were able to transcribe a French audio file and print the result. You can read more about the supported languages .

7. Congratulations!

You learned how to use the Speech-to-Text API using Python to perform different kinds of transcription on audio files!

To clean up your development environment, from Cloud Shell:

  • If you're still in your IPython session, go back to the shell: exit
  • Stop using the Python virtual environment: deactivate
  • Delete your virtual environment folder: cd ~ ; rm -rf ./venv-speech

To delete your Google Cloud project, from Cloud Shell:

  • Retrieve your current project ID: PROJECT_ID=$(gcloud config get-value core/project)
  • Make sure this is the project you want to delete: echo $PROJECT_ID
  • Delete the project: gcloud projects delete $PROJECT_ID
  • Test the demo in your browser: https://cloud.google.com/speech-to-text
  • Speech-to-Text documentation: https://cloud.google.com/speech-to-text/docs
  • Python on Google Cloud: https://cloud.google.com/python
  • Cloud Client Libraries for Python: https://github.com/googleapis/google-cloud-python

This work is licensed under a Creative Commons Attribution 2.0 Generic License.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

IMAGES

  1. TEXT TO SPEECH IN PYTHON

    python speech to text

  2. Convert Text to Speech in Python

    python speech to text

  3. Python Text to Speech Converter

    python speech to text

  4. Simple Text To Speech In Python

    python speech to text

  5. How to use Speech-to-Text with Python?

    python speech to text

  6. Easy steps to Convert Text to Speech in Python || Using Pyttsx3 ||Text

    python speech to text

COMMENTS

  1. The Ultimate Guide To Speech Recognition With Python

    Best of all, including speech recognition in a Python project is really simple. In this guide, you’ll find out how. You’ll learn: How speech recognition works, What packages are available on PyPI; and; How to install and use the SpeechRecognition package—a full-featured and easy-to-use Python speech recognition library.

  2. Python: Convert Speech to text and text to Speech

    There are several APIs available to convert text to speech in Python. One of such APIs is the Google Text to Speech API commonly known as the gTTS API. gTTS is a very easy to use tool which converts the text entered, into audio which can be saved as a mp3 file.

  3. SpeechRecognition · PyPI

    pip install SpeechRecognitionCopy PIP instructions. Latest version. Released: May 4, 2024. Library for performing speech recognition, with support for several engines and APIs, online and offline.

  4. GitHub

    Tech Stack. This library uses: Voice Activity Detection. WebRTCVAD for initial voice activity detection. SileroVAD for more accurate verification. Speech-To-Text. Faster_Whisper for instant (GPU-accelerated) transcription. Wake Word Detection. Porcupine for wake word detection.

  5. Using the Speech-to-Text API with Python

    Overview. The Speech-to-Text API enables developers to convert audio to text in over 125 languages and variants, by applying powerful neural network models in an easy to use API....