Google Vision API: Image Analysis as a Service

Build powerful applications that see and understand the content of images with the Google Vision API

The Google Vision API was released last month, on December 2nd 2015, and it’s still in limited preview. You can request access to this limited preview program here and you should receive a very quick email follow-up.
I recently requested access with my personal Google Cloud Platform account in order to understand what types of analysis are supported. This also allows me to perform some tests.
Google Vision API Face Detection (Lena)

Image analysis and features detection

The Google Vision API provides a  RESTful interface that quickly analyses image content. This interface hides the complexity of continuously evolving machine learning models and image processing algorithms.
These models will improve overall system accuracy – especially as far as object detection – since new concepts almost certainly will be introduced in the system over time.
In more detail, the API lets you annotate images with the six following features.

  1. LABEL_DETECTION: executes Image Content Analysis on the entire image and provides relevant labels (i.e. keywords & categories).
  2. TEXT_DETECTION: performs Optical Character Recognition (OCR) and provides the extracted text, if any.
  3. FACE_DETECTION: detects faces, provides facial key points, main orientation, emotional likelihood, and the like.
  4. LANDMARK_DETECTION: detects geographic landmarks.
  5. LOGO_DETECTION: detects company logos.
  6. SAFE_SEARCH_DETECTION: determines image safe search properties on the image (i.e. the likelihood that an image might contain violence or nudity).

You can annotate all these features at once (i.e. with a single upload), although the API seems to respond slightly faster if you focus on one or two features at a time.

At this time, the API only accepts a series of base64-encoded images as input, but future releases will be integrated with Google Cloud Storage so that API calls won’t require image uploads at all. This will offer substantially faster invocation.

Label Detection – Scenarios and examples

Label detection is definitely the most interesting annotation type. This feature adds semantics to any image or video stream by providing a set of relevant labels (i.e. keywords) for each uploaded image. Labels are selected among thousands of object categories and mapped to the official Google Knowledge Graph. This allows image classification and enhanced semantic analysis, understanding, and reasoning.

Technically, the actual detection is performed on the image as a whole, although an object extraction phase may be executed in advance on the client in order to extract a set of labels for each single object. In this case, each object should be uploaded as an independent image. However, this may lead to lower-quality results if the resolution isn’t high enough, or if the object context is more relevant than the object itself — for the application’s purpose.

So what do label annotations look like?

"labelAnnotations": [
    {
        "score": 0.99989069,
        "mid": "/m/0ds99lh",
        "description": "fun"
    },
    {
        "score": 0.99724227,
        "mid": "/m/02jwqh",
        "description": "vacation"
    },
    {
        "score": 0.63748151,
        "mid": "/m/02n6m5",
        "description": "sun tanning"
    }
]

The API returns something very similar to the JSON structure above for each uploaded image. Each label is basically a string (the description field) and comes with a relevance score (0 to 1) and a Knowledge Graph reference.
You can specify how many labels the API should return at request time (3 in this case) and the labels will be sorted by relevance. I could have asked for 10 labels and then thresholded their relevance score to 0.8 in order to consider only highly relevant labels in my application (in this case only two labels would have been used).

Here is an example of the labels given by the Google Vision API for the corresponding image:
desk
The returned labels are: “desk, room, furniture, conference hall, multimedia, writing.”
The first label – “desk” – had a relevance score of 0.97, while the last one – “writing” – had a score of 0.54.
I have programmatically appended the annotations to the input image (with a simple Python script). You can find more Label Detection examples on this public gist.

Personally, I found the detection accurate on every image I uploaded. In some cases though, no labels were returned at all, and very few labels sounded misleading even with a relevance score above 0.5.

Text Detection – OCR as a Service

Optical Character Recognition is not a new problem in the field of image analysis, but it often requires high-resolution images, very little perspective distortion and an incredibly precise text extraction algorithm. In my personal experience, the character classification step is actually the easiest one, and there are plenty of techniques and benchmarks in the literature.

In the case of the Google Vision API, everything is encapsulated in a REStful API that simply returns a string and its bounding box. As I would have expected from Google, the API is able to recognize multiple languages, and will return the detected locale together with the extracted text.
Here is an example of a perfect extraction:
Google Vision API - Text Detection (OCR)
The API response looks very similar to the following JSON structure:

"textAnnotations": [
    {
        "locale": "en",
        "description": "Sometimes\nwhen I Am\nAlone I Google\nMyself\n",
        "boundingPoly": {
            "vertices": [
                {"y": 208, "x": 184},
                {"y": 208, "x": 326},
                {"y": 314, "x": 326},
                {"y": 314, "x": 184}
            ]
        }
    }
]

Only one bounding box is detected, in the English language, and the textual content is even split by break lines.

I have been running a few more examples in which much more text was detected in different areas of the image, but it was all collapsed into a single (and big) bounding box, where each text was separated by a break line. This doesn’t make the task of extracting useful information easy, and the Google team is already gathering feedback about this on the official limited preview Google Group (of which I’m proud to be a part).

What about handwritten text and CAPTCHAs?

Apparently, the quality is not optimal for handwritten text, although I believe it’s more than adequate for qualitative analysis or generic tasks, such as document classification.
Here is an example:
Google Vision API - OCR handwritten
With the corresponding extracted text:

water ran down
her hair and clothes it ran
down into the toes Lof her
shoes and out again at the
heels. And het she said
that
Mwas a real princess

As I mentioned, it’s not perfect, but it would definitely help most of us a lot.
On the other hand, CAPTCHAs recognition is not as easy. It seems that crowdsourcing is still a better option for now. 😉

Face Detection – Position, orientation, and emotions

Face detection aims at localizing human faces inside an image. It’s a well-known problem that can be categorized as a special case of a general object-class detection problem. You can find some interesting data sets here.

I would like to stress two important points:

  • It is NOT the same as Face Recognition, although the detection/localization task can be thought of as one of the first steps in the process of recognizing someone’s face. This typically involves many more techniques, such as facial landmarks extraction, 3D analysis, skin texture analysis, and others.
  • It usually targets human faces only (yes, I have tried primates and dogs with very poor results).

If you ask the Google Vision API to annotate your images with the FACE_DETECTION feature, you will obtain the following:

  • The face position (i.e. bounding boxes);
  • The landmarks positions (i.e. eyes, eyebrows, pupils, nose, mouth, lips ears, chin, etc.), which include more than 30 points;
  • The main face orientation (i.e. roll, pan, and tilt angles);
  • Emotional likelihoods (i.e. joy, sorrow, anger, surprise, etc), plus some additional information (under exposition likelihood, blur likelihood, headwear likelihood, etc.).

Here is an example of face recognition, where I have programmatically rendered the extracted information on the original image. In particular, I am framing each face into the corresponding bounding boxes, rendering each landmark as a red dot and highlighting the main face orientation with a green arrow (click for higher resolution).
Google Vision API - Face Detection
As you can see, every face is correctly detected and well localized. The precision is pretty high even with numerous faces in the same picture, and the orientation is also accurate.

In this example, just looking at the data, we might infer that the picture contains 5 happy people who are most likely facing something or someone around the center of the image. If you complement this analysis with label detection, you would obtain “person” and “team” as most relevant labels, which would give your software a pretty accurate understanding of what is going on.

You can find more face detection example on this public gist.

Alpha testing Conclusions

Although still in limited preview, the API is surprisingly accurate and fast: queries takes just milliseconds to execute. The processing takes longer with larger images, mostly because of the upload time.
I am looking forward to the Google Cloud Storage integration, and to the many improvements already suggested on the Google Group by the many active alpha testers.

I didn’t focus much on the three remaining features yet – geographic landmarks detection, logo detection and safe search detection – but the usage of such features will probably sound straightforward to most of you. Please feel free to reach out or drop a comment if you have doubts or tests suggestions.
You can find all the Python utilities I used for these examples here.

If you are interested in Machine Learning technologies and Google Cloud Platform, you may want to have a look at my previous article about Google Prediction API as well.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.

Related Posts

Avatar
Alex Casalboni
— September 3, 2019

Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison

Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...

Read more
  • Amazon Rekognition
  • AWS
  • Google Cloud Platform
  • Google Vision
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Avatar
Andrew Larkin
— August 13, 2019

Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More

Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...

Read more
  • AWS
  • Azure
  • content roadmap
  • Google Cloud Platform
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— August 6, 2019

Google Cloud vs AWS: A Comparison (or can they be compared?)

The "Google Cloud vs AWS" argument used to be a common discussion among our members, but is this still really a thing? You may already know that there are three major players in the public cloud platforms arena: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)...

Read more
  • AWS
  • Google Cloud Platform
  • Kubernetes
Luca Casartelli
Luca Casartelli
— April 19, 2019

4 Key Takeaways from Google Cloud Next ’19

Google Cloud Next ’19 was the flagship Google Cloud Platform developers conference, held in San Francisco’s Moscone Center. I was lucky enough to attend it with Cloud Academy, and got the chance to check out tons of breakout sessions and get great insight firsthand.   Next ’19 was my...

Read more
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Avatar
Giacomo Marinangeli
— March 29, 2019

NEW: Custom Hands-On Labs for Azure and Google Cloud Platform

Harvard Business Review recently estimated that some 90% of corporate training never gets applied on the job. Given the $200B training industry, that is a staggering amount of waste. One reason for the disconnect? Lack of context. Cloud Academy’s platform was built to make it extraor...

Read more
  • Azure
  • Content Engine
  • Google Cloud Platform
  • Hands-on Labs
Avatar
Andrew Larkin
— January 15, 2019

2018 Was a Big Year for Content at Cloud Academy

As Head of Content at Cloud Academy I work closely with our customers and my domain leads to prioritize quarterly content plans that will achieve the best outcomes for our customers. We started 2018 with two content objectives: To show customer teams how to use Cloud Services to solv...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Avatar
Guy Hummel
— November 21, 2018

Google Cloud Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Avatar
Cloud Academy Team
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • SpotInst
Avatar
Guy Hummel and Jeremy Cook
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google Cloud Artificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • Machine Learning
Avatar
Ben Lambert
— February 15, 2018

Is Multi-Cloud a Solution for High Availability?

With the average cost of downtime estimated at $8,850 per minute, businesses can’t afford to risk system failure. Full access to services and data anytime, anywhere is one of the main benefits of cloud computing. By design, many of the core services with the public cloud and its unde...

Read more
  • AWS
  • Azure
  • Cloud Adoption
  • Google Cloud Platform