Overview of the Azure Speech Service
Overview of the Azure Speech Service

In this course, you’ll learn about the Azure Speech and Translator services. You’ll learn about the key features and what’s possible with each of these services. You’ll find out how these services can be used with SDKs, REST APIs, and the command-line interface. You’ll see demos of the Translator, Speech-to-text, and Text-to-speech services in action and understand how to use them.

Learning Objectives

  • Understand the main capabilities of the Azure Speech service
  • Learn available options for using the Azure Speech service
  • Translate text using the Translator service
  • Translate speech to text
  • Translate text to speech

Intended Audience

This course is intended for developers or software architects who want to learn more about the Azure Speech and Translator services.


Intermediate knowledge of C# and coding techniques is required for this course. You’ll also need to be familiar with Azure concepts such as App Service and resources. An understanding of APIs is also required. We’ll be writing a little code and will be using Visual Studio Community Edition, so you’ll need a copy of that too.


Let's look at an Overview of the Azure Speech Service. In this lecture, we're going to introduce what the Azure Speech Service is. We'll look at some of the key features the Azure speech service provides. We'll see some of the things that you can do with the Azure speech service. So, what is the Azure speech service? The Azure speech service brings together a common set of text, speech, and translation services that are available in Microsoft Azure. At a high level, these capabilities include speech to text, text to speech, and speech translation.

The speech to text service contains functionality that lets you transcribe and translate audio. Speech to text can work with audio streams. Alternatively, you can use it to process speech and files on the local machine. Speech to text makes it simple for you to create conversational transcripts from audio. You can then extract the text and perform additional processing or use it in other tools, devices, or software you may have built. For example, you may choose to blend speech to text with the NLP service LUIS and help identify customer intent.

Text to speech converts any input text you supply to near human-like synthesized speech. It does this by using the speech synthesis markup language or SSML. SSML lets you programmatically adjust the speaking style of any speech generated by the service. For example, you may decide to set the style to be empathetic for customer service reasons. Alternatively, you may be creating an automated breaking news service and decide to use a more professional tone. Text to speech also contains support for over 100 languages and multiple voice types. You can find out more about SSML, voice types, and supported languages by visiting the URL in this slide.

The speech translation features of the Azure speech service contain features that let you identify spoken languages, performs speech to text translation, performs speech to speech translation, as well as including support for multiple languages. Support for audio files or streams is available when using speech translation. When performing a translation, you can choose to run a continuous translation and listen for events, or alternatively you can perform a single short translation. Speech translation capabilities let you create applications that can perform language identification and speech translation in real time.

Let's look at some of the key features they share with Azure speech services. The Azure speech service contains many features to help you create innovative solutions using speech technology. Azure speech service gives you the option of creating a custom speech model. You can use this in situations when performing speech to text transcription in unique environments. For example, this might be when creating solutions that must understand nuanced language with specific pronunciations or domain specific language or vocabulary.

The pronunciation assessment features let you evaluate speech pronunciation. This feature gives speakers feedback on the accuracy of any spoken audio. The pronunciation assessment feature can be used in several uses cases. For example, you can use this to create applications that help learners understand the new language. Speaker recognition is also available. This feature uses algorithms to help you identify who is speaking in spoken audio and can be useful when implementing security functionality. It does this by identifying unique voice characteristics.

To implement speaker recognition, each speaker you want to identify must initially record a passphrase from a predefined list of phrases. Voice features and signals are then extracted from the audio to create a unique voice signature. This also contains the passphrase. You can then send subsequent audio to the speaker verification API, which will then verify if the voice signature in the audio and passphrase match any previously enrolled speakers. The Azure speech service gives you multiple ways to interact with the features that the APIs provide. You can use dedicated client SDKs, REST API, the CLI or command line interface, and support for Docker containers is also available. A more recent addition is a citizen developer friendly web portal called Speech Studio. We'll take a closer look at some of these now.

The Azure speech service APIs can be accessed using a dedicated client SDK. Client SDK is available in multiple languages such as C#, C++, Go, Java, JavaScript, Objective-C, and Python. Client SDKs can run in multiple platforms such as Windows, Linux, MacOS, IoS, or Xamarin. You can use the speech SDK in both real time and non real time scenarios. For example, use the SDK to process audio from your local file system, Azure blob storage, or even audio streams.

One thing to be aware of is that the speech SDK exposes many features from the speech services in Azure, but not all of them. The REST API is another way to access the speech service capabilities in Azure. It contains some functionality that isn't supported in the client SDK libraries. The REST API is typically used for batch transcription and when working with custom speech models. You can use the REST API to copy speech models between Azure subscriptions. This can be useful when you need to share speech models between teams.

Support for performing bulk transcription of audio from multiple audio files or URLs is also supported. You can also upload data from Azure storage accounts using a shared access signature with the REST API. This can be useful if your application deposits many audio files into blob storage and you need to perform bulk transcription. Learn more about the REST API capabilities at the following URL. Speech CLI is a command line tool that lets you consume Azure speech services without having to write code. Using the Speech CLI requires minimal setup and you can get started in just a few minutes.

The CLI is a good way to quickly experiment with some of the main capabilities you can find in the Azure speech service. You can use it to run tests with simple batch files or shell scripts. Use the Speech CLI when you have simple requirements or want to quickly try out the Azure speech service without having to write code. One thing to note is that Speech CLI contains simplified versions of the client SDK and REST APIs. If security is a concern or you have specific data governance requirements, you can run some of the Azure speech service APIs in a Docker container.

It's important to point out that if you want to use the Azure speech service in the Docker container, you'll need to submit an online request to Microsoft and have it approved. We've introduced the Azure speech service and explored  some of the key features that are available.

Let's explore some of the ways that you can use these APIs. So, what can you do with the Azure speech service? Use the text to speech capabilities to increase accessibility in your software applications. For example, implement functionality that helps people with visual impairments by providing text to speech. Widen the reach of your content by translating audio in real time, this could be the recording of a podcast or Twitter space. Use the neural voice capability to generate human-like speech. This can be useful when you want to add a consistent sounding voice and also can be useful for branding.

Use a mixture of the Azure speech APIs to fast track the development of intelligent and reliable voice assistance. Using the translation features in the Azure speech service can improve the accessibility of your software products. For example, you might only have capacity to handle English queries on your help desks, but receive queries to your SaaS product in multiple languages. You can use the translation features of Azure speech to automatically translate incoming queries for your team.

In the past, I had to implement a chatbot that could handle incoming audio from a telephone call being made to a Twilio phone number. The challenge at the time was that the chatbot could only handle text commands. I used the speech to text API to convert the audio to a text variant and sent this to the chatbot. The chatbot could then return the appropriate text response. Text to speech was then used to convert the chatbot's text response to audio which was sent back to the human on the phone.

Use speech services to help you perform  an in depth analysis of audio. For example, transcribe speech to text, then apply text analytics, process the transcribed audio using the Azure text analytics API, and perform sentiment analysis, perform key phrase extraction to help you identify the main topics being discussed in the audio. Automate the identification of sensitive information. Use this to flag potential data breaches. Implementing Azure speech services gives users different ways of interacting with data. Speech services can widen the footprint of your application, improve the accessibility of your product, and help you create innovative solutions. Next, let's take a look at the translator service.


About the Author

Jamie Maguire is a Software Architect, Developer, Microsoft MVP (AI), and lifelong tech enthusiast with over 20 years of professional experience.

Jamie is passionate about using AI technologies to help advance systems in a wide range of organizations. 

He has collaborated on many projects including working with Twitter, National Geographic, and the University of Michigan. Jamie is a keen contributor to the technology community and has gained global recognition for articles he has written and software he has built. 

He is a STEM Ambassador and Code Club volunteer, inspiring interest at grassroots level. Jamie shares his story and expertise at speaking events, on social media, and through podcast interviews. 

He has co-authored a book with 16 fellow MVPs demonstrating how Microsoft AI can be used in the real world and regularly publishes material to encourage and promote the use of AI and .NET technologies.