Getting Started with Google Cloud Speech API

Discover the Strengths and Weaknesses of Google Cloud Speech API in this Special Report by Cloud Academy’s Roberto Turrin

Google recently opened its brand new Cloud Speech API – announced at the NEXT event in San Francisco – for a limited preview.
This speech recognition technology has been developed and already used by several Google products for some time, such as the Google search engine where there is the option to make voice search.
The capability to convert voice to text is based on deep neural networks, state-of-the-art machine learning algorithms recently demonstrated to be particularly effective for pattern detection in video and audio signals. The neural network is updated as new speech samples are collected by Google, so that new terms are learned and the recognition accuracy keeps on increasing.

Speech-to-Text in the Cloud

Speech-to-text features are used in a multitude of use cases including voice-controlled smart assistants on mobile devices, home automation, audio transcription, and automatic classification of phone calls.
Now that such technology will be accessible as a cloud service to developers, it will allow any application to integrate speech-to-text recognition, representing a valuable alternative to the common Nuance technology (used by Apple’s Siri and Samsung’s S-Voice, for instance) and challenging other solutions such as the IBM Watson speech-to-text and the Microsoft Bing Speech API.

An Outline of the Google Cloud Speech API

The API, still in alpha, exposes a RESTful interface that can be accessed via common POST HTTP requests.
The batch processing is very straightforward; just by providing the audio file to process and describing its format the API returns the best-matching text, together with the recognition accuracy. Optionally, it can be requested to return multiple alternatives in addition to the best-matching, each one with the estimated accuracy.
The file to recognise can be provided both by including the audio signal into the HTTP request payload (encoded with Base64) or by giving the URI of the file (currently, only Google Storage can be used). Supported formats are raw audio and FLAC format, while MP3 and AAC are not accepted.
In order to improve the accuracy of the system, words or sentences can be attached to the request as text. This is particularly useful in the case of noisy audio signals or when uncommon, domain-specific words are present.
Additional, interesting options are the filter for profanities – which allow to mask profanities with asterisks – and the possibility to receive interim results, i.e., partial results marked as non-final.
A few clients are provided for common programming languages (e.g., Python, Java, iOS, Node.js), both for batch and real-time requests (with asynchronous responses).

My Initial Experience and Code Samples

My quick experience with the API has revealed quite an accurate technology. Regardless the APIs do not accept MP3 as input audio, I took the chance to stress the system and I tried to experiment with an MP3 file containing an online English lesson. I converted the first 15 seconds of the file to a 200-KB FLAC format that I submitted to the Google Speech APIs with the following Python script:

import requests
import base64
import json
# encoding audio file with Base64 (~200KB, 15 secs)
with open(speech_file_path, 'rb') as speech:
    speech_content = base64.b64encode(speech.read())
payload = {
    'initialRequest': {
        'encoding': FLAC,
        'sampleRate': 16000,
    },
    'audioRequest': {
        'content': speech_content.decode('UTF-8'),
    },
}
# POST request to Google Speech API
r = requests.post(url, data=json.dumps(payload))

In a few seconds I obtained the response:

{
    "responses": [
        {
            "results": [
                {
                    "alternatives": [
                        {
                            "confidence": 0.90157032,
                            "transcript": "hi this is AJ with another effortless English
podcast for today is Monday and I'm here in San Francisco I'm back in"
                        }
                    ],
                    "isFinal": true
                }
            ]
        }
    ]
}

The text recognition is quite accurate regardless the submitted sound does not respect the best practices for the speech APIs (such as using a native FLAC format).
However, the returned text is completely unstructured: it is a flat list of words. I actually expected some basic NLP features (natural language processing) – such as PoS recognition(part-of-speech) – in order to better describe the detected sentences. Obviously, this can be obtained by post-processing the text with third-party NLP tools and a few lines of code such as:

from nltk import sent_tokenize, word_tokenize, pos_tag
results = r.json()['responses'][0]['results']
transcript = [
    resp['alternatives'][0]['transcript']
    for resp in results
    if resp['isFinal']
][0]
# split text into sentences
sentences = sent_tokenize(transcript)
for sentence in sentences:
    # split text into tokens
    tokens = word_tokenize(sentence)
    # detect the PoS of each token
    for tag in pos_tag(tokens):
        print '%s: %s' % (tag[0], tag[1])
hi: NN
this: DT
is: VBZ
AJ: NNP
with: IN
another: DT
effortless: NN
English: NNP

Anyway, you can immediately note that the lack of any key punctuation (e.g., question marks, full stops, etc.) makes the work of the PoS tagger particularly challenging, as not even sentences are not recognised.
In my opinion, the integration of punctuation and PoS tagging directly in the speech-to-text service might take advantage of the properties of the speech signal  – such as intonation, gaps, etc. – which provide useful information to identify particular elements of the whole speech and of the single sentences.
Finally, the voice recognition still does not provide the capability to identify the source of the voice – in our case multiple subjects are speaking. This feature could be used, for instance, to transcript dialogs.

More Information

Waiting for the API to be publicly released to all developers, if your curiosity has been aroused and you wish to play with the speech-to-text service, just go to https://www.google.com/intl/it/chrome/demos/speech.html for a live demo (or enrol for the limited preview). You will be surprised by the number of supported languages – 80 – way more than the 38 supported by Nuance, the 28 by Microsoft Bing Speech, and the 8 by IBM Watson.

Written by

Sr. Data Scientist, currently leading the data-driven activities at Cloud Academy.

Related Posts

— November 21, 2018

Google Cloud Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...

Read more
  • AWS
  • Azure
  • Google Cloud
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Machine Learning
— June 26, 2018

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to compute resources including CPU, memory, storage, and network connectivity. Which resources you choose for your delivery, cloud-based or local, is up to you. But you’ll definitely want to do your homework first.Cloud ...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud
— February 15, 2018

Is Multi-Cloud a Solution for High Availability?

With the average cost of downtime estimated at $8,850 per minute, businesses can’t afford to risk system failure. Full access to services and data anytime, anywhere is one of the main benefits of cloud computing.By design, many of the core services with the public cloud and its underl...

Read more
  • AWS
  • Azure
  • Cloud Adoption
  • Google Cloud
— January 25, 2018

New Whitepaper: Separating Multi-Cloud Strategy from Hype

A 2017 RightScale survey* reported that 85% of enterprises have embraced a multi-cloud strategy. However, depending on whom you ask, multi-cloud is either an essential enterprise strategy or a nonsense buzzword.Part of the reason for such opposing views is that we lack a complete defi...

Read more
  • AWS
  • Azure
  • Google Cloud
— January 15, 2018

4 Trends That Will Change How Companies Invest in Cloud in 2018

The cloud is forever changing how we look at IT. Over the past years, we’ve had a front seat view of how the cloud has evolved and how large companies and industries are changing practices internally toward a response that looks more and more like  the innovation leaders have read about...

Read more
  • AWS
  • Azure
  • Cloud Migration
  • Google Cloud
— September 19, 2017

New on Cloud Academy, September '17. Big Data, Security, and Containers

Explore the newest Learning Paths, Courses, and Hands-on Labs on Cloud Academy in September.Learning Paths and CoursesCertified Big Data Specialty on AWS Solving problems and identifying opportunities starts with data. The ability to collect, store, retrieve, and analyze data meanin...

Read more
  • AWS
  • Docker
  • Google Cloud
— July 6, 2017

New Azure, Google Cloud, DevOps learning paths & labs: get ready for your certification!

At Cloud Academy, we’re busy adding new content to help you achieve your goals across AWS, Microsoft Azure, Google Cloud Platform, and DevOps.Whether you’re looking to get certified or just want to learn new skills, we know that getting started can be a stumbling block for learners who...

Read more
  • Azure
  • Google Cloud
  • IoT
— July 4, 2017

Google Cloud Functions vs. AWS Lambda: Fight for serverless cloud domination begins

A not entirely fair comparison between alpha-release Google Cloud Functions and mature AWS Lambda: My insights into the game-changing future of serverless clouds.Serverless computing lands on Google Cloud: Welcome to Google Cloud FunctionsUpdate: The open beta of Google Cloud Functi...

Read more
  • AWS
  • Azure
  • Google Cloud
— April 24, 2017

Cloud Academy Platform News: Roadmap of Courses for Q2 2017

In addition to all of the new services and technologies coming to cloud computing, one of the most important factors influencing cloud computing is the fact that companies are embracing multiple cloud technologies.As a result, bridging the skills gap is even more important than ever b...

Read more
  • AWS
  • Cloud Computing
  • DevOps
  • Google Cloud
— April 20, 2017

Cloud Academy 2017 Conferences and Events Worldwide!

Our second quarter got off to a spectacular start, down under, at AWS Summit Sydney. We had the great pleasure of starting off our visit by hosting the AWS Partner Summit Sydney, where we were able to engage with other sponsors, like ScienceLogic, TrendMicro, NetApp, CHEF, and Splunk, a...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • DevOps
  • Google Cloud