Getting Started with Google Cloud Speech API

Discover the Strengths and Weaknesses of Google Cloud Speech API in this Special Report by Cloud Academy’s Roberto Turrin

Google recently opened its brand new Cloud Speech API – announced at the NEXT event in San Francisco – for a limited preview.

This speech recognition technology has been developed and already used by several Google products for some time, such as the Google search engine where there is the option to make voice search.
The capability to convert voice to text is based on deep neural networks, state-of-the-art machine learning algorithms recently demonstrated to be particularly effective for pattern detection in video and audio signals. The neural network is updated as new speech samples are collected by Google, so that new terms are learned and the recognition accuracy keeps on increasing.

Speech-to-Text in the Cloud

Speech-to-text features are used in a multitude of use cases including voice-controlled smart assistants on mobile devices, home automation, audio transcription, and automatic classification of phone calls.
Now that such technology will be accessible as a cloud service to developers, it will allow any application to integrate speech-to-text recognition, representing a valuable alternative to the common Nuance technology (used by Apple’s Siri and Samsung’s S-Voice, for instance) and challenging other solutions such as the IBM Watson speech-to-text and the Microsoft Bing Speech API.

An Outline of the Google Cloud Speech API

The API, still in alpha, exposes a RESTful interface that can be accessed via common POST HTTP requests.
The batch processing is very straightforward; just by providing the audio file to process and describing its format the API returns the best-matching text, together with the recognition accuracy. Optionally, it can be requested to return multiple alternatives in addition to the best-matching, each one with the estimated accuracy.

The file to recognise can be provided both by including the audio signal into the HTTP request payload (encoded with Base64) or by giving the URI of the file (currently, only Google Storage can be used). Supported formats are raw audio and FLAC format, while MP3 and AAC are not accepted.
In order to improve the accuracy of the system, words or sentences can be attached to the request as text. This is particularly useful in the case of noisy audio signals or when uncommon, domain-specific words are present.

Additional, interesting options are the filter for profanities – which allow to mask profanities with asterisks – and the possibility to receive interim results, i.e., partial results marked as non-final.
A few clients are provided for common programming languages (e.g., Python, Java, iOS, Node.js), both for batch and real-time requests (with asynchronous responses).

My Initial Experience and Code Samples

My quick experience with the API has revealed quite an accurate technology. Regardless the APIs do not accept MP3 as input audio, I took the chance to stress the system and I tried to experiment with an MP3 file containing an online English lesson. I converted the first 15 seconds of the file to a 200-KB FLAC format that I submitted to the Google Speech APIs with the following Python script:

import requests
import base64
import json
# encoding audio file with Base64 (~200KB, 15 secs)
with open(speech_file_path, 'rb') as speech:
    speech_content = base64.b64encode(
payload = {
    'initialRequest': {
        'encoding': FLAC,
        'sampleRate': 16000,
    'audioRequest': {
        'content': speech_content.decode('UTF-8'),
# POST request to Google Speech API
r =, data=json.dumps(payload))

In a few seconds I obtained the response:

    "responses": [
            "results": [
                    "alternatives": [
                            "confidence": 0.90157032,
                            "transcript": "hi this is AJ with another effortless English
podcast for today is Monday and I'm here in San Francisco I'm back in"
                    "isFinal": true

The text recognition is quite accurate regardless the submitted sound does not respect the best practices for the speech APIs (such as using a native FLAC format).
However, the returned text is completely unstructured: it is a flat list of words. I actually expected some basic NLP features (natural language processing) – such as PoS recognition(part-of-speech) – in order to better describe the detected sentences. Obviously, this can be obtained by post-processing the text with third-party NLP tools and a few lines of code such as:

from nltk import sent_tokenize, word_tokenize, pos_tag
results = r.json()['responses'][0]['results']
transcript = [
    for resp in results
    if resp['isFinal']
# split text into sentences
sentences = sent_tokenize(transcript)
for sentence in sentences:
    # split text into tokens
    tokens = word_tokenize(sentence)
    # detect the PoS of each token
    for tag in pos_tag(tokens):
        print '%s: %s' % (tag[0], tag[1])
hi: NN
this: DT
is: VBZ
with: IN
another: DT
effortless: NN
English: NNP

Anyway, you can immediately note that the lack of any key punctuation (e.g., question marks, full stops, etc.) makes the work of the PoS tagger particularly challenging, as not even sentences are not recognised.
In my opinion, the integration of punctuation and PoS tagging directly in the speech-to-text service might take advantage of the properties of the speech signal  – such as intonation, gaps, etc. – which provide useful information to identify particular elements of the whole speech and of the single sentences.
Finally, the voice recognition still does not provide the capability to identify the source of the voice – in our case multiple subjects are speaking. This feature could be used, for instance, to transcript dialogs.

More Information

Waiting for the API to be publicly released to all developers, if your curiosity has been aroused and you wish to play with the speech-to-text service, just go to for a live demo (or enrol for the limited preview). You will be surprised by the number of supported languages – 80 – way more than the 38 supported by Nuance, the 28 by Microsoft Bing Speech, and the 8 by IBM Watson.

Cloud Academy