1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Translating Language by Using the Azure Speech Service

Translating Speech to Text


Translating Language with Azure Speech Service
1m 7s
Start course

In this course, you’ll learn about the Azure Speech and Translator services. You’ll learn about the key features and what’s possible with each of these services. You’ll find out how these services can be used with SDKs, REST APIs, and the command-line interface. You’ll see demos of the Translator, Speech-to-text, and Text-to-speech services in action and understand how to use them.

Learning Objectives

  • Understand the main capabilities of the Azure Speech service
  • Learn available options for using the Azure Speech service
  • Translate text using the Translator service
  • Translate speech to text
  • Translate text to speech

Intended Audience

This course is intended for developers or software architects who want to learn more about the Azure Speech and Translator services.


Intermediate knowledge of C# and coding techniques is required for this course. You’ll also need to be familiar with Azure concepts such as App Service and resources. An understanding of APIs is also required. We’ll be writing a little code and will be using Visual Studio Community Edition, so you’ll need a copy of that too.


Let's look at translating speech-to-text using the Azure speech service. In this lecture, we'll take a closer look at the Azure speech-to-text service. We'll look at some of the key features the service provides. We'll explore some of the use cases for speech-to-text and finally, we'll see a demo of speech-to-text in action. So, what is the Azure speech-to-text service? Speech-to-text is a feature that belongs to the Azure speech service. It can be used to automatically create high quality transcriptions from audio clips.

Transcribed texts can then be used as commands in your software projects or tools. For example, issuing a search to a smart device. To consume the speech-to-text feature, you need an Azure subscription, dedicated Azure speech resource and an API key. There are three input source channels that the speech-to-text feature can listen to for audio. You can take the audio directly from a microphone, you can load audio from a storage container in Azure,  or you can load audio stored in the memory stream. For example, in a byte array, the action you select will depend entirely on your use case or business requirements.

There are two recognition options when performing a speech-to-text conversion. These are known as single-shot recognition and continuous recognition. Single-shot recognition is typically used to perform a speech-to-text conversion against a single utterance. The end of a single utterance is identified by listening for silence at the end of the audio. The end of a single utterance is also reached when a maximum of 15 seconds of audio has been processed. Continuous recognition as an event-based recognition mechanism: Using this approach gives you more control in terms of starting and stopping the recognition process. I've personally used continuous recognition in the past to intercept and parse spoken audio from telephone calls being made on Twilio.

Using continuous recognition requires you to subscribe to the recognizing, recognized, and canceled events. The recognizing events signals that initial recognition results are available. The recognized event signals a successful recognition attempt has occurred. The recognition results are also included as an object. The canceled events is raised when a recognition attempt was canceled or if an error has occurred. After configuring the speech resource in Azure and you have obtained your API keys, you can consume the speech-to-text feature in one of three ways: You can use a dedicated Speech SDK, you can use the Speech CLI or you can invoke the API by using REST requests.

The option you select depends entirely on your business requirements or use case. You can use the speech-to-text feature in the SDK using a language of your choice. Such as C#, Go, C++, JavaScript or Python. Calling the speech-to-text feature via the SDK involves three main steps. The first step is to create an instance of the speechRecognizer, parsing in the speechConfig and audioConfig settings. To perform a single capture, call the RecognizeOnceAsync method.

After a few moments, the speech-to-text service will return the SpeechRecognitionResult object. The recognition result object contains the speech-to-text conversion. You can also consume the speech-to-text feature by using the CLI. This is something referred to as spx. The CLI lets you quickly test out the speech-to-text feature without having to write any code. The first thing you must do is set your API key and Azure region. Calling spx recognize and parsing the microphone parameter will perform a speech-to-text recognition and use your system default microphone. Calling spx recognize and setting the file parameter performs speech-to-text from an audio file. That's what we can see here.

One of the main reasons to use the REST API for speech-to-text conversions is to implement batch transcriptions for large amounts of audio and storage. Some of the available REST API methods include but are not limited to; creating new transcriptions, fetching a list of transcriptions for the authenticated Azure subscription, deleting a specified transcription task or fetching a transcription using a given ID. Your speech resource service in Azure must be configured to use the standard subscription or a zero. Free subscription keys such as F zero will not work. You can find out more information about this on the pricing page in the Azure portal. 

Let's take a closer look at the key features of speech-to-text. The main capabilities of the speech-to-text feature within the Azure speech service include; real time speech-to-text conversion, batch speech-to-text conversion, and finally custom speech conversion. We'll explore each of these now. When performing speech-to-text conversion against real time audio sources, the main audio sources you can use are microphone, audio files, or data stored in memory such as an array of bytes, the main components and use when performing real time speech-to-text translation include SpeechConfig, Source Language Config, and AudioConfig. 

The SpeechConfig object accepts Azure subscription key and Azure service region as parameters in the constructor. This object can also be used to let you set what is known as the Source Language Config. The source language lets the speech-to-text service know which language to expect. The Audio Config object is used to define the audio input method that you wish to use. The audio input method can include; default microphone input, memory stream input, or wave file input.

After defining the SpeechConfig and selecting the AudioConfig, each of these objects can be used to create an instance of the SpeechRecognizer. The SpeechRecognizer accepts the SpeechConfig and AudioConfig objects as constructor parameters. The SpeechRecognizer includes methods that activate the speech-to-text conversion. If you have large amounts of audio and storage, you can use the batch transcription REST APIs. The REST APIs can be pointed to audio files stored at the URL on a shared access signature. Support for formats such as WAV, MP3, and OGG are available.

Batch transcription can be used asynchronously to receive transcription results. This makes it simple for you to perform speech-to-text translation at scale. You also have the option of creating a custom speech model. A custom speech model can be used to augment the Azure speech-to-text based model. The custom speech model makes it possible for you to improve the recognition of industry or domain specific vocabulary.

There are six main steps involved if you wish to use a custom speech model. The first step is to create a custom speech project using the Azure speech studio web portal. The second step involves gathering and uploading your test data. The third step is to use Azure speech studio to test the recognition quality of your sample data. During this phase, you'll find out if you need to gather additional data to help improve the speech-to-text prediction results. Step four involves providing written transcripts, text, and related audio data. You can also choose to further evaluate your custom model during this phase. After you're happy with the results of your custom speech model that can then be published to a custom endpoint, the custom endpoint can then be used programmatically by other software applications or services to perform speech-to-text conversion using your custom model.

One of the most obvious applications of the speech-to-text service is for transcription. The speech-to-text service can be used to augment human translator capacity. This can help you drive an increased business efficiencies. Use the generated text to help you create automated reporting. For example, aggregate data from transcribed call center recordings, then apply text analytics. This can help you identify popular topics being discussed during phone calls. Accessibility is another use case where speech-to-text can be leveraged. Use speech-to-text to help you increase the accessibility with people with specific needs. For example, those that have sustained injuries or have physical impairments.

Use speech-to-text to help support people with dyslexia by implementing dictation features in your software application. Natural language understanding is a key component in conversational AI and chatbots development. This is normally performed using LUIS in the Microsoft cloud. Lace speech-to-text with LUIS to help your application understand what is really being said on voice channels. For example, take inbound audio data and use the output text to surface the main specific commands for using your application. Take this further and augment LUIS's understanding by using a custom speech model.

Speech-to-text lets you perform real time or offline transcription of audio to text. Use the transcriptions in your software applications, tools, or devices as commands. Use it to introduce business efficiencies or to build more inclusive software products. It's now time for a demo. In this demo, we'll see speech-to-text in action. So, we are back in Visual Studio here and what we're looking at here is another console application, and in this console application, what we're going to do is write some code that uses this Speech SDK that can interpret speech and output text.

The first thing that we have to do is add a reference to the Speech SDK, and we can do that just by right clicking on the project and adding the NuGet reference. So, we'll do that just now. And search for it, and this is the SDK that we're looking for here. So, we'll go ahead and install that. With the SDK installed, we can now start to write some code that uses app. So, we'll close the NuGet package window and just minimize the errors' list pane.

So, what we can see here is we have two variables: one to store the speech API key and other to store the location that our speech resource in Azure resides in. If we expand the main method for this console application, we can see here that we already have some existing code. So, in this first section here, we set up the speechConfig object parsing and the speech key in the location. We also set the speech recognition language and that tells the service what language to expect.

In section two here, what we do is set up the audioConfig object, and what this does is it tells the speech service where to accept input from. So, here we are using the default microphone on my laptop with the speechConfig object and audioConfig objects instantiated, we can then create an instance of the SpeechRecognizer, and as we've seen earlier in the slides, this object accepts the speechConfig object and audioConfig object as part of the constructor. We create a prompt to give us a nudge that we can speak into the microphone. And then in section four here, we call the speechRecognizer object's RecognizeOnceAsync method. So, those are the four main steps that are involved at a high level. When the recognizer has done its thing, the output of that parsing from speech-to-text is stored in the variable SpeechRecognitionResult.

Now, we also have another private method that handles this, this will un-comment this code, and what we can see here is we have a method called OutputRecognitionResultDetails. This accepts the SpeechRecognitionResult object as a parameter and one of the properties of this object is the reason. Now that's an enum and we can see here that we have a switch statement to handle this. So, the first part of the switch statement is a successful parsing of speech-to-text. When that happens, we will output the recognized text. If we don't get a match between lines 41 and 43, we will output that into the console and say that speech could not be recognized.

If for one reason or another the parsing of speech-to-text is canceled, we output the cancelation details along with a reason, error code, and error details. We'll reinstate this method. And what we can now do is we can run the application. When we run this console application, we're going to get a prompt and we can speak into the microphone, and what we'll see is the speech being parsed to text. We'll do that just now. Speaking to your microphone. So, we can see here that the enum recognized as being picked in that switch statement. We can just pull the console application over, and we can see hee that the RecognitionResult object contains the property text, and that text is the text that we just spoke. So, that was it. In this demo, we've seen a console application that uses the Azure speech SDK and speech-to-text API.


About the Author

Jamie Maguire is a Software Architect, Developer, Microsoft MVP (AI), and lifelong tech enthusiast with over 20 years of professional experience.

Jamie is passionate about using AI technologies to help advance systems in a wide range of organizations. 

He has collaborated on many projects including working with Twitter, National Geographic, and the University of Michigan. Jamie is a keen contributor to the technology community and has gained global recognition for articles he has written and software he has built. 

He is a STEM Ambassador and Code Club volunteer, inspiring interest at grassroots level. Jamie shares his story and expertise at speaking events, on social media, and through podcast interviews. 

He has co-authored a book with 16 fellow MVPs demonstrating how Microsoft AI can be used in the real world and regularly publishes material to encourage and promote the use of AI and .NET technologies.