1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Translating Language by Using the Azure Speech Service

Translating Text to Speech


Translating Language with Azure Speech Service
1m 7s
Start course

In this course, you’ll learn about the Azure Speech and Translator services. You’ll learn about the key features and what’s possible with each of these services. You’ll find out how these services can be used with SDKs, REST APIs, and the command-line interface. You’ll see demos of the Translator, Speech-to-text, and Text-to-speech services in action and understand how to use them.

Learning Objectives

  • Understand the main capabilities of the Azure Speech service
  • Learn available options for using the Azure Speech service
  • Translate text using the Translator service
  • Translate speech to text
  • Translate text to speech

Intended Audience

This course is intended for developers or software architects who want to learn more about the Azure Speech and Translator services.


Intermediate knowledge of C# and coding techniques is required for this course. You’ll also need to be familiar with Azure concepts such as App Service and resources. An understanding of APIs is also required. We’ll be writing a little code and will be using Visual Studio Community Edition, so you’ll need a copy of that too.


Let's look at translating text to speech using the Azure Speech service. In this lecture, we're going to take a closer look at the Azure text to speech service. We'll look at some of the key features the service provides, we'll explore some of the use cases for text to speech, and finally we'll see a demo of text to speech in action. So, what is the Azure text to speech service? Text to speech is a feature that belongs to the Azure Speech service. It can be used to automatically convert text to human-like speech. This process is known as speech synthesis. At the time of this course, support for 270 voice types are available with text to speech service. Text to speech makes it simple for you to automatically generate synthetic speech using text as an input source. To use the text to speech API, you need an Azure subscription. You also need an Azure speech resource configured in the Azure portal. A set of API keys are also needed.

The two voice types available to you when using the text to speech service are: prebuilt neural voice and custom neural voice. Prebuilt neural voice is a default out of the box voices that ship with the text to speech service. Custom neural voices are voices that sound more natural. These offer more flexibility under the preferred option if you want to fine tune  a synthetic voice or create something unique to your business or brand. There are three main speech synthesis options. These are: sending speech to a file, sending speech to an output device such as a speaker, sending speech to an in-memory stream. Opting to send speech to a file will result in a synthesized wave file being written to the location that you specified. You may want to send synthesized speech directly to a speaker. This is one of the simplest ways to use the text to speech service and can be done in just three lines of code.

Quite often, you'll likely want to manipulate the data in memory rather than working with data in the file system. This is where the in-memory stream option is useful. Audio located in memory streams is held in a battery. This makes it easier to integrate audio data generated by the text to speech service in your software application code. After configuring the speech resource in Azure and obtaining your API key, you can consume the text to speech service in one of three ways: you can use a dedicated speech SDK, you can use the speech CLI or command line interface, or you can invoke the API by using REST requests. The option that you select depends entirely on your business requirements or use case. You can use the speech to text service with the SDK in a language of your choice. This can include C#, Go, C++, JavaScript, or Python. Coding the text to speech feature via the speech SDK involves three main steps. The first step is to reference the speechConflig object passing in your Azure subscription key in Azure region. The second step involves selecting the language and type of voice that you want any text to be spoken in. The third step involves creating an instance of the Speech Synthesizer passing in the speechConfig settings. The Speech Synthesizer object can then be used to send speech directly to file, an output device such as a speaker, or an in-memory stream. You can also use the text to speech feature using the CLI. The CLI lets you quickly test the text to speech feature without writing any code.

The first thing you must do is set your API key in Azure region. Using the spx synthesize command, you set the text and voice parameters. These define the text you want spoken and the voice that you want to use. A third parameter can let you set where you want the synthesized speech to be sent. This can be set to the speakers or an output file. One thing to note is that when performing text to speech with the CLI, you can't output synthesized speech to a memory stream. The REST API lets you perform a text to speech conversion using HTTP REST requests. Use this if the dedicated SDKs aren't an option for you. To use the REST API, you must complete a token exchange during authentication. This can be done by sending a request to the Azure token endpoint and supplying your API key. After successfully authenticating, you can use the V1 text to speech endpoint. Consuming the endpoint and performing text to speech synthesis involves supplying data using speech synthesis markup language or SSML.

Let's take a closer look at the key features you can find in the text to speech service. The text to speech service contains many features to help you create innovative solutions. If you need to perform text to speech conversion against large bodies of text, you can use the Long Audio API. For example, you may need to create audio versions of a book that you've just written. Synthetic speech can be fine-tuned using the synthetic speech markup language or SSML. In addition to prebuilt voices, text to speech also lets you create and fine-tune custom neural voices. Use these to help you implement voice styles that match your brand or business. To implement a custom neural voice, you need to supply a collection of audio files and associated transcripts. Find out more about these features and more at the URL at the bottom of this slide.

When working with the Long Audio API, you can supply your content in plain text or SSML format. Any content you supply to the Long Audio API must be encoded as UTF-8 and sent in a single file. The content cannot be stored in a compressed zip file. One thing to be aware of is that each text file must contain more than 400 characters in less than 10,000 paragraphs.

Speech synthesis markup language is an XML-based markup language that lets you specify how input text can be converted to synthesized speech. This is commonly referred to as SSML. The two mandatory nodes for an SSML document include the speak node and the voice node. We can see an example of an SSML document in the green box here. In this example, we can see the voice name has been defined JaneNeural, we can also see the text that is to be spoken. Passing this SSML document to the text to speech service will result in the machine creating synthesized speech using the text contained within the node voice name. SSML can also be used to fine tune synthesized speech created by the text to speech service. Some of the examples of fine tuning include but are not limited to adjusting the style of the language. For example, specifying cheerful, empathetic, or angry tonality, adding pause to give people more time to comprehend what has just been said by the machine, increasing speed at certain points of a conversation. Fine tuning SSML makes it possible for you to create more personalized interactions. Custom neural voices are created by providing your own audio samples as training data, thus can then be used to create more natural sounding custom neural voices.

This can be useful if you want to use a specific voice that's unique to your company or for a brand positioning. Custom neural voices can be created, maintained, and deployed using Azure Speech Studio. The process is similar to using speech to text for custom speech modules and involves a number of steps. After creating a custom voice project using the Azure Speech Studio web portal, the first step involves collecting and uploading your training data. Training data includes audio recordings and a text file with associated transcriptions. It's important to point out that each audio file should contain a single utterance or sentence and must be less than 15 seconds long. After sourcing your training data, it can then be used to train the custom neural model. Training data for your custom voice must contain at least 300 utterances. You also need to specify a voice talent profile. The voice talent profile must provide consent for his or her audio to be used as part of the custom neural voice training process. The custom neural voice model can then be deployed. After it has been deployed, the custom neural voice model is accessible from a neural voice endpoint. This endpoint can then be used programmatically by other software applications or services to provide text to speech conversions using your custom neural voice. Text to speech can be used in a variety of use cases. You can use text to speech to give people different ways to consume educational content. For example, build capabilities and use software applications that let students and learners playback lecture notes. Use text to speech to provide consumers with different ways of interacting with their data. For example, automatically reading bank statements in cellphones. We can also use text to speech to help you implement applications that help people consume content on the go. Another area where text to speech can be leveraged is accessibility. Use text to speech to help increase the accessibility of software applications and products. This can be useful for people with visual impairments or disabilities. For example, implement features that give people the option of having the written words spoken to them. You may also choose to use text to speech to optimize the scalability of your software products or services. For example, use text to speech to help you repurpose content such as eBooks, web pages, or transcriptions as podcasts. Doing this can help you widen the reach of your content and help you amplify your brand, messaging, or product offering. Text to speech helps your applications and devices convert text to human-like synthesized speech. It's now time for a demo.

In this demo, we'll see text to speech in action. So, we're back in Visual Studio, and in this demo what we will do is create a console application that lets us supply some text. That text will then be spoken using the speech SDK. So, the first thing that we can see here is we have two variables: one to store the speech API key and another to store the region that our Azure Speech service has been provisioned in. The next thing that we have to do is to add a reference to the speech SDK via NuGet. So, we'll do that just now. And this is the SDK. So, with the SDK installed, we can now input that into our console application. So, we've imported the speech SDK. So, we've got an entry point for this console application down here at line 13, so we can expand this. And what we'll see is some code that I have previously written. Now, we can uncomment this code. And we can see that we've got three main steps.

So, the first step is to get the API key Azure region, the language and voice that we want our console application to use to generate the synthesized speech, and that's what we can see here. When we've got that information, we then create an instance of the speech  that synthesize that object, passing in the speech config object that was previously configured. Next, we prompt the user for some text. And then at step three, we output that input text to the speaker. So, we can run this application. And here we can supply some text. So, we will say, "Welcome to lecture four." When we press 'Return', this text will be output to the laptop. "Welcome to lecture four." And the console application ends. So, in this demo, what we have seen is how to create a console application. We've also seen how to use the Azure speech service text to speech feature.

About the Author

Jamie Maguire is a Software Architect, Developer, Microsoft MVP (AI), and lifelong tech enthusiast with over 20 years of professional experience.

Jamie is passionate about using AI technologies to help advance systems in a wide range of organizations. 

He has collaborated on many projects including working with Twitter, National Geographic, and the University of Michigan. Jamie is a keen contributor to the technology community and has gained global recognition for articles he has written and software he has built. 

He is a STEM Ambassador and Code Club volunteer, inspiring interest at grassroots level. Jamie shares his story and expertise at speaking events, on social media, and through podcast interviews. 

He has co-authored a book with 16 fellow MVPs demonstrating how Microsoft AI can be used in the real world and regularly publishes material to encourage and promote the use of AI and .NET technologies.