1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Designing Solutions Using Azure Cognitive Services

Speech

Contents

keyboard_tab
Introduction & Overview
1
Introduction
PREVIEW1m 36s
Cognitive Services Features
4
Vision
10m 49s
6
Speech
9m 58s
Course Summary
12
Summary
1m 42s

The course is part of these learning paths

play-arrow
Start course
Overview
DifficultyIntermediate
Duration54m
Students308
Ratings
4.3/5
starstarstarstarstar-half

Description

Artificial Intelligence is not a future or distant concept; it is here and now, and being used by many companies of various sizes and industries. The foundational theory for AI was actually developed several decades ago, but recent advancements in big data, computing power, cloud, and algorithms have made it affordable and widespread today. With AI and Machine Learning, computers are now able to start reasoning, understanding, and interacting in ways that were never possible before.

Microsoft has created a predefined set of AI models available for companies of all sizes to start with called Cognitive Services, and best of all, they require little to no knowledge of data science. In this course, you will learn how to infuse your apps—on an architectural level—with the intelligence that Cognitive Services provide. We will cover what Cognitive Services are and how to use the various solutions they provide, including Vision, Speech, Language, Decision, and Web Search.

Learning Objectives

  • Understand the functionality provided by Azure Cognitive Services
  • Learn how to incorporate these services into your apps

Intended Audience

  • People who want to learn more about Azure Cognitive Services

Prerequisites

  • Knowledge of Azure
  • Knowledge of at least one programming language
  • Experience using REST APIs

Transcript

Now it's time for Speech. The Speech category is mostly composed of one API called Speech Services. It does three main things; Speech-to-Text (STT), Text-to-Speech (TTS) and Speech Translation. Historically, there were many Speech APIs and some of them had the Bing branding, for example, the Bing Speech API. Now, all of these functionalities are merged into Speech Services, and Microsoft recommends migrating from the old APIs to this new one.

This technology is actually quite mature, as it has been used by Microsoft for many years. It is low latency, to the subsequent level after the last byte is received and natively integrated to the Bot Framework, so your bots can use voice if preferred.

Language support varies by function. STT supports 40 languages, while TTS works with 45 languages but it has a total of 75 voices to choose from, including choice of gender. Speech Translation can work with 60 languages, including Klingon, in case you're a Star Trek fan.

There's also a service called Speaker Recognition, which allows you to use speech to authenticate people or recognize them in audio. It works very similarly to the face authentication. It has an enrollment phase, where a profile and voice signature are created by recording a specific passphrase. That voice signature can be used later on for authentication or identification.

Let's start talking about Speech-to-Text. As the name implies, this technology allows you to understand speech, and convert it to written text. The possibilities are endless, from creation of transcripts and subtitles, to customer service logging, to dictation. Let's see the characteristics of this service.

Same as with previous services, you can use the subscription key for authentication. However, STT inherits from other Azure Services the capability to use tokens, which are generated from the keys. Because tokens expire after 10 minutes, they are considerably much more secure than using keys and they are the recommended access method.

The service can handle profanity, and it does that in three ways; Raw, which shows the profanity without filtering, Masked, which replaces it with asterisks and Removed, which deletes it.

STT has three main recognition modes:

  • Recognize Once, which is ideal for short conversations of up to 20 minutes, such as when you're giving commands to a Bot.
  • Continuous, which is better suited for longer conversations where you can control the recognition using start and stop functions.
  • Dictation, which interprets word descriptions of sentence structures such as punctuation. For example, the phrase "Do you live in town, question mark" would be interpreted as the text "Do you live in town?" with a question mark at the end.

Also similar to other Cognitive Services, the response will be in JSON. There are two JSON response types on STT. Simple gives just a simple JSON file with status (for example, Success), the Display Text itself, the Offset, which is the time before someone started talking, and the Duration of the sentence. Detailed gives all of that and also an NBest array with additional guesses and their corresponding confidence levels. It also includes four sentences for each guess: Lexical, which is the original text that STT understood, Inverse-Text-Normalize, or ITN, which applies some rules to the raw lexical text, for example, it might replace the word five with the number five, Masked, which removes or highlights profanity, and finally, Display shows the final result after normalization and profanity handling.

Okay, now let's talk about a bit of a messy subject on the Speech Services, which is about how to connect to the service. What makes things complicated here is that some features are only available through the SDK, whereas others are only available through REST. This article below covers these details, but I'll highlight the most important aspects here.

There are four ways to connect to STT. The first option is to connect via Speech-to-Text REST API. This API accepts only WAV and OGG formats, for more supported formats, it's better to use the SDKs. Also, this REST API is synchronous, which leads to two limitations. First, the limit of the audio being sent is 60 seconds, which makes it more suitable for shorter conversations and bot commands. Second, it only allows final results, which means that the response will only happen after the whole message is sent.

Then we have the Batch Transcription API. The main difference for this REST endpoint is that it's asynchronous, which makes it ideal for batch processing of audio files, such as audiobook generation or call center logging. Note that there's no SLA for this service, which means they might not run immediately but they tend to run pretty fast once the job starts. The files need to reside on Azure Blob Storage.

Next, we have the Speech SDKs, which allows you to connect using functions in your preferred language. As they are asynchronous, they are better suited for longer conversations, dictation or streaming audio. They can also accept partial results, which allows you to start processing a longer message while it's still being sent.

Finally, if you want to develop solutions using specialized hardware, such as Amazon Alexa or Google Home, you can use the Speech Devices SDK, which is better suited for speech-enabled devices.

Now Text-to-Speech. Text-to-Speech allows you to use one of the several Microsoft-provided voices to communicate, instead of using just text. That unlocks a lot of possibilities for your applications, from Bots to better accessibility for people with visual impairments. Here are a few characteristics of this function. 

TTS only uses tokens for authentication, not subscription keys. As I have mentioned in the STT section, tokens increase the security of your application as they expire after 10 minutes.

There are three kinds of voices in TTS. Standard Voices have high quality and are available in 75 voices in 45 languages. However, the intonation still reflects the fact that this is an AI. Neural Voices provide a far better experience in terms of distress, intonation, and fluidity, therefore sounding more natural and human-like. However, they're so far only available in English, male and female voices, German, Portuguese, Italian and Chinese. You can also create your own voice models using Custom Speech, which can give even more personalization to your apps.

TTS uses an XML-based format called SSML, Speech Synthesis Markup Language, to define the speech settings, such as voice, language, speed, pitch, volume, pronunciation, pauses or even sentiment. Keep in mind that some of these options are only available for standard voices, whereas others are designed for neural voices.

Speech Services has also a translation component. It leverages the same technology as the Translation Text API, which we'll cover in-depth on the Language section. But here are a few comments that are particular to this function. Speech Translation is only available through the SDK, not the REST API. You can return the results as either text or voice, provided, of course, that the language that you're translating to is one of the 45 that has a voice available. Same as with STT, you can handle profanity on the translated text, with the masked, raw and removed options.

Finally, let's talk about custom solutions for Speech Services. There are several ways to customize Speech, and they used to be scattered all over many websites. Now, they are all consolidated in just one, called Speech Studio. Keep in mind that Speech Studio can only use REST, not SDKs.

Speech Studio can personalize speech models in the following ways:

  • You can create Acoustic models for certain environments such as an Airport, or inside a car for a specific speaker, perhaps with a stronger accent or a speech impairment, or even for specialized recording devices. That allows you for a more precise speech recognition.
  • You can also create custom Language models, for a specific industry or jargon, for example, the auto or pharma business. Or, perhaps, you want to create a Star Wars model that can recognize words such as Millennium Falcon or Stormtroopers.
  • Since we brought up Star Wars, you can even create custom pronunciation models so that Speech Services knows how to spell C3PO or R2-D2.
  • Finally, and this is one of the coolest options in the Speech Studio, you can create your own voice fonts, instead of using the ones provided by Microsoft. Keep in mind that, although relatively simple in nature, this is a laborious task, as it requires at least eight hours of high-quality audio with the corresponding transcriptions for best results.

There are also more customization options available. You can check the full list at speech.microsoft.com.

About the Author
Students307
Courses1

Emilio Melo has been involved in IT projects in over 15 countries, with roles ranging across support, consultancy, teaching, project and department management, and sales—mostly focused on Microsoft software. After 15 years of on-premises experience in infrastructure, data, and collaboration, he became fascinated by Cloud technologies and the incredible transformation potential it brings. His passion outside work is to travel and discover the wonderful things this world has to offer.