Getting Started with Google Cloud Speech API

Discover the Strengths and Weaknesses of Google Cloud Speech API in this Special Report by Cloud Academy’s Roberto Turrin

Google recently opened its brand new Cloud Speech API – announced at the NEXT event in San Francisco – for a limited preview.

This speech recognition technology has been developed and already used by several Google products for some time, such as the Google search engine where there is the option to make voice search.
The capability to convert voice to text is based on deep neural networks, state-of-the-art machine learning algorithms recently demonstrated to be particularly effective for pattern detection in video and audio signals. The neural network is updated as new speech samples are collected by Google, so that new terms are learned and the recognition accuracy keeps on increasing.

Speech-to-Text in the Cloud

Speech-to-text features are used in a multitude of use cases including voice-controlled smart assistants on mobile devices, home automation, audio transcription, and automatic classification of phone calls.
Now that such technology will be accessible as a cloud service to developers, it will allow any application to integrate speech-to-text recognition, representing a valuable alternative to the common Nuance technology (used by Apple’s Siri and Samsung’s S-Voice, for instance) and challenging other solutions such as the IBM Watson speech-to-text and the Microsoft Bing Speech API.

An Outline of the Google Cloud Speech API

The API, still in alpha, exposes a RESTful interface that can be accessed via common POST HTTP requests.
The batch processing is very straightforward; just by providing the audio file to process and describing its format the API returns the best-matching text, together with the recognition accuracy. Optionally, it can be requested to return multiple alternatives in addition to the best-matching, each one with the estimated accuracy.

The file to recognise can be provided both by including the audio signal into the HTTP request payload (encoded with Base64) or by giving the URI of the file (currently, only Google Storage can be used). Supported formats are raw audio and FLAC format, while MP3 and AAC are not accepted.
In order to improve the accuracy of the system, words or sentences can be attached to the request as text. This is particularly useful in the case of noisy audio signals or when uncommon, domain-specific words are present.

Additional, interesting options are the filter for profanities – which allow to mask profanities with asterisks – and the possibility to receive interim results, i.e., partial results marked as non-final.
A few clients are provided for common programming languages (e.g., Python, Java, iOS, Node.js), both for batch and real-time requests (with asynchronous responses).

My Initial Experience and Code Samples

My quick experience with the API has revealed quite an accurate technology. Regardless the APIs do not accept MP3 as input audio, I took the chance to stress the system and I tried to experiment with an MP3 file containing an online English lesson. I converted the first 15 seconds of the file to a 200-KB FLAC format that I submitted to the Google Speech APIs with the following Python script:

import requests
import base64
import json
# encoding audio file with Base64 (~200KB, 15 secs)
with open(speech_file_path, 'rb') as speech:
    speech_content = base64.b64encode(speech.read())
payload = {
    'initialRequest': {
        'encoding': FLAC,
        'sampleRate': 16000,
    },
    'audioRequest': {
        'content': speech_content.decode('UTF-8'),
    },
}
# POST request to Google Speech API
r = requests.post(url, data=json.dumps(payload))

In a few seconds I obtained the response:

{
    "responses": [
        {
            "results": [
                {
                    "alternatives": [
                        {
                            "confidence": 0.90157032,
                            "transcript": "hi this is AJ with another effortless English
podcast for today is Monday and I'm here in San Francisco I'm back in"
                        }
                    ],
                    "isFinal": true
                }
            ]
        }
    ]
}

The text recognition is quite accurate regardless the submitted sound does not respect the best practices for the speech APIs (such as using a native FLAC format).
However, the returned text is completely unstructured: it is a flat list of words. I actually expected some basic NLP features (natural language processing) – such as PoS recognition(part-of-speech) – in order to better describe the detected sentences. Obviously, this can be obtained by post-processing the text with third-party NLP tools and a few lines of code such as:

from nltk import sent_tokenize, word_tokenize, pos_tag
results = r.json()['responses'][0]['results']
transcript = [
    resp['alternatives'][0]['transcript']
    for resp in results
    if resp['isFinal']
][0]
# split text into sentences
sentences = sent_tokenize(transcript)
for sentence in sentences:
    # split text into tokens
    tokens = word_tokenize(sentence)
    # detect the PoS of each token
    for tag in pos_tag(tokens):
        print '%s: %s' % (tag[0], tag[1])
hi: NN
this: DT
is: VBZ
AJ: NNP
with: IN
another: DT
effortless: NN
English: NNP

Anyway, you can immediately note that the lack of any key punctuation (e.g., question marks, full stops, etc.) makes the work of the PoS tagger particularly challenging, as not even sentences are not recognised.
In my opinion, the integration of punctuation and PoS tagging directly in the speech-to-text service might take advantage of the properties of the speech signal  – such as intonation, gaps, etc. – which provide useful information to identify particular elements of the whole speech and of the single sentences.
Finally, the voice recognition still does not provide the capability to identify the source of the voice – in our case multiple subjects are speaking. This feature could be used, for instance, to transcript dialogs.

More Information

Waiting for the API to be publicly released to all developers, if your curiosity has been aroused and you wish to play with the speech-to-text service, just go to
https://www.google.com/intl/it/chrome/demos/speech.html for a live demo (or enrol for the limited preview). You will be surprised by the number of supported languages – 80 – way more than the 38 supported by Nuance, the 28 by Microsoft Bing Speech, and the 8 by IBM Watson.

Avatar

Written by

Roberto Turrin

Sr. Data Scientist, currently leading the data-driven activities at Cloud Academy.


Related Posts

Avatar
Guy Hummel
— December 12, 2019

Google Cloud Platform Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— December 5, 2019

New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More

Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...

Read more
  • AWS
  • AWS solution architect
  • AZ-203
  • Azure
  • cyber security
  • FCCS
  • Foundation Certificate in Cyber Security
  • Google Cloud Platform
  • Kubernetes
Alisha Reyes
Alisha Reyes
— November 6, 2019

New on Cloud Academy: AZ-900 Exam Update; MS-100 Exam Prep; PRINCE2 Foundation; Azure, Kubernetes, and Google Hands-on Labs; and Much More

This month, our Content Team really kicked it into overdrive with tons of new content. If you're Team Azure, then you'll be amazed at the number of Azure Courses and Hands-on Labs we published this month alone!  At any time, you can find all of our new releases by going to our Training ...

Read more
  • AZ-900
  • Azure
  • Google Cloud Platform
  • Kubernetes
  • MS-100
  • New content
  • PRINCE2
  • Product Feature
Joe Nemer
Joe Nemer
— October 30, 2019

How to Get Hands-on Experience on AWS, Azure, and GCP: Lab Challenges

Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how kids and adults best learn — doing the actual learning task is key. Hands-on Labs and Lab Challe...

Read more
  • AWS Labs
  • Azure
  • Google Cloud Platform
  • Hands-on Labs
Avatar
Cloud Academy Team
— October 23, 2019

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— October 1, 2019

New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more

At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • ITIL® 4
  • Jenkins
  • Microsoft 365 Tenant
  • New content
  • Product Feature
  • Python programming
  • TOGAF® 9.1
Joe Nemer
Joe Nemer
— September 6, 2019

Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination

Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— September 3, 2019

Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison

Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...

Read more
  • Amazon Rekognition
  • AWS
  • Google Cloud Platform
  • Google Vision
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Avatar
Andrew Larkin
— August 13, 2019

Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More

Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...

Read more
  • AWS
  • Azure
  • content roadmap
  • Google Cloud Platform
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— August 6, 2019

Google Cloud vs AWS: A Comparison (or can they be compared?)

The "Google Cloud vs AWS" argument used to be a common discussion among our members, but is this still really a thing? You may already know that there are three major players in the public cloud platforms arena: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)...

Read more
  • AWS
  • Google Cloud Platform
  • Kubernetes