Getting Started with Google Cloud Speech API

Discover the Strengths and Weaknesses of Google Cloud Speech API in this Special Report by Cloud Academy’s Roberto Turrin

Google recently opened its brand new Cloud Speech API – announced at the NEXT event in San Francisco – for a limited preview.

This speech recognition technology has been developed and already used by several Google products for some time, such as the Google search engine where there is the option to make voice search.
The capability to convert voice to text is based on deep neural networks, state-of-the-art machine learning algorithms recently demonstrated to be particularly effective for pattern detection in video and audio signals. The neural network is updated as new speech samples are collected by Google, so that new terms are learned and the recognition accuracy keeps on increasing.

Speech-to-Text in the Cloud

Speech-to-text features are used in a multitude of use cases including voice-controlled smart assistants on mobile devices, home automation, audio transcription, and automatic classification of phone calls.
Now that such technology will be accessible as a cloud service to developers, it will allow any application to integrate speech-to-text recognition, representing a valuable alternative to the common Nuance technology (used by Apple’s Siri and Samsung’s S-Voice, for instance) and challenging other solutions such as the IBM Watson speech-to-text and the Microsoft Bing Speech API.

An Outline of the Google Cloud Speech API

The API, still in alpha, exposes a RESTful interface that can be accessed via common POST HTTP requests.
The batch processing is very straightforward; just by providing the audio file to process and describing its format the API returns the best-matching text, together with the recognition accuracy. Optionally, it can be requested to return multiple alternatives in addition to the best-matching, each one with the estimated accuracy.

The file to recognise can be provided both by including the audio signal into the HTTP request payload (encoded with Base64) or by giving the URI of the file (currently, only Google Storage can be used). Supported formats are raw audio and FLAC format, while MP3 and AAC are not accepted.
In order to improve the accuracy of the system, words or sentences can be attached to the request as text. This is particularly useful in the case of noisy audio signals or when uncommon, domain-specific words are present.

Additional, interesting options are the filter for profanities – which allow to mask profanities with asterisks – and the possibility to receive interim results, i.e., partial results marked as non-final.
A few clients are provided for common programming languages (e.g., Python, Java, iOS, Node.js), both for batch and real-time requests (with asynchronous responses).

My Initial Experience and Code Samples

My quick experience with the API has revealed quite an accurate technology. Regardless the APIs do not accept MP3 as input audio, I took the chance to stress the system and I tried to experiment with an MP3 file containing an online English lesson. I converted the first 15 seconds of the file to a 200-KB FLAC format that I submitted to the Google Speech APIs with the following Python script:

import requests
import base64
import json
# encoding audio file with Base64 (~200KB, 15 secs)
with open(speech_file_path, 'rb') as speech:
    speech_content = base64.b64encode(speech.read())
payload = {
    'initialRequest': {
        'encoding': FLAC,
        'sampleRate': 16000,
    },
    'audioRequest': {
        'content': speech_content.decode('UTF-8'),
    },
}
# POST request to Google Speech API
r = requests.post(url, data=json.dumps(payload))

In a few seconds I obtained the response:

{
    "responses": [
        {
            "results": [
                {
                    "alternatives": [
                        {
                            "confidence": 0.90157032,
                            "transcript": "hi this is AJ with another effortless English
podcast for today is Monday and I'm here in San Francisco I'm back in"
                        }
                    ],
                    "isFinal": true
                }
            ]
        }
    ]
}

The text recognition is quite accurate regardless the submitted sound does not respect the best practices for the speech APIs (such as using a native FLAC format).
However, the returned text is completely unstructured: it is a flat list of words. I actually expected some basic NLP features (natural language processing) – such as PoS recognition(part-of-speech) – in order to better describe the detected sentences. Obviously, this can be obtained by post-processing the text with third-party NLP tools and a few lines of code such as:

from nltk import sent_tokenize, word_tokenize, pos_tag
results = r.json()['responses'][0]['results']
transcript = [
    resp['alternatives'][0]['transcript']
    for resp in results
    if resp['isFinal']
][0]
# split text into sentences
sentences = sent_tokenize(transcript)
for sentence in sentences:
    # split text into tokens
    tokens = word_tokenize(sentence)
    # detect the PoS of each token
    for tag in pos_tag(tokens):
        print '%s: %s' % (tag[0], tag[1])
hi: NN
this: DT
is: VBZ
AJ: NNP
with: IN
another: DT
effortless: NN
English: NNP

Anyway, you can immediately note that the lack of any key punctuation (e.g., question marks, full stops, etc.) makes the work of the PoS tagger particularly challenging, as not even sentences are not recognised.
In my opinion, the integration of punctuation and PoS tagging directly in the speech-to-text service might take advantage of the properties of the speech signal  – such as intonation, gaps, etc. – which provide useful information to identify particular elements of the whole speech and of the single sentences.
Finally, the voice recognition still does not provide the capability to identify the source of the voice – in our case multiple subjects are speaking. This feature could be used, for instance, to transcript dialogs.

More Information

Waiting for the API to be publicly released to all developers, if your curiosity has been aroused and you wish to play with the speech-to-text service, just go to
https://www.google.com/intl/it/chrome/demos/speech.html for a live demo (or enrol for the limited preview). You will be surprised by the number of supported languages – 80 – way more than the 38 supported by Nuance, the 28 by Microsoft Bing Speech, and the 8 by IBM Watson.

Avatar

Written by

Roberto Turrin

Sr. Data Scientist, Head of Technology at Cloud Academy.


Related Posts

Amanda Cross
Amanda Cross
— April 9, 2021

New Content: Platforms, Programming, and DevOps – Something for Everyone

This month our team of expert certification specialists released three new or updated learning paths, 16 courses, 13 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon....

Read more
  • alibaba
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • programming
  • Security
Amanda Cross
Amanda Cross
— March 12, 2021

New Content: Focus on DevOps and Programming Content this Month

This month our team of expert certification specialists released 12 new or updated learning paths, 15 courses, 25 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon. Ja...

Read more
  • alibaba
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • programming
Amanda Cross
Amanda Cross
— February 12, 2021

New Content: Get Ready for the CISM Cert Exam & Learn About Alibaba, Plus All the AWS, GCP, and Azure Courses You Know You Can Count On

This month our team of intrepid certification specialists released five learning paths, seven courses, 19 hands-on labs, and three lab challenges!  One particularly interesting new learning path is Certified Information Security Manager (CISM) Foundations. After completing this learn...

Read more
  • alibaba
  • AWS
  • Azure
  • cism
  • DevOps
  • Google Cloud Platform
  • programming
Avatar
Guy Hummel
— February 9, 2021

Google Cloud Platform Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2020, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure and Platform Services for the third c...

Read more
  • Certifications
  • Google Cloud Platform
Avatar
Cloud Academy Team
— January 31, 2021

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Amanda Cross
Amanda Cross
— January 7, 2021

New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses

This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Stefano Cascavilla
Stefano Cascavilla
— November 12, 2020

How Do We Handle Google Subscription Notifications at Cloud Academy?

If you have ever used the Cloud Academy Android application, you should be aware that you have the option to subscribe to our platform by paying through Google Pay. We offered this option because most mobile users nowadays prefer paying with integrated payment systems offered by both An...

Read more
  • decoupling
  • engineering
  • Google Cloud Platform
  • Google Cloud Pub/Sub
Joe Nemer
Joe Nemer
— October 14, 2020

New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More

This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs.  New content on Cloud Academy At any ...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— September 15, 2020

New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More

This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming