Build powerful applications that see and understand the content of images with the Google Vision API

The Google Vision API was released last month, on December 2nd 2015, and it’s still in limited preview. You can request access to this limited preview program here and you should receive a very quick email follow-up.

I recently requested access with my personal Google Cloud Platform account in order to understand what types of analysis are supported. This also allows me to perform some tests.

Google Vision API Face Detection (Lena)

Image analysis and features detection

The Google Vision API provides a  RESTful interface that quickly analyses image content. This interface hides the complexity of continuously evolving machine learning models and image processing algorithms.

These models will improve overall system accuracy – especially as far as object detection – since new concepts almost certainly will be introduced in the system over time.

In more detail, the API lets you annotate images with the six following features.

  1. LABEL_DETECTION: executes Image Content Analysis on the entire image and provides relevant labels (i.e. keywords & categories).
  2. TEXT_DETECTION: performs Optical Character Recognition (OCR) and provides the extracted text, if any.
  3. FACE_DETECTION: detects faces, provides facial key points, main orientation, emotional likelihood, and the like.
  4. LANDMARK_DETECTION: detects geographic landmarks.
  5. LOGO_DETECTION: detects company logos.
  6. SAFE_SEARCH_DETECTION: determines image safe search properties on the image (i.e. the likelihood that an image might contain violence or nudity).

You can annotate all these features at once (i.e. with a single upload), although the API seems to respond slightly faster if you focus on one or two features at a time.

At this time, the API only accepts a series of base64-encoded images as input, but future releases will be integrated with Google Cloud Storage so that API calls won’t require image uploads at all. This will offer substantially faster invocation.

Label Detection – Scenarios and examples

Label detection is definitely the most interesting annotation type. This feature adds semantics to any image or video stream by providing a set of relevant labels (i.e. keywords) for each uploaded image. Labels are selected among thousands of object categories and mapped to the official Google Knowledge Graph. This allows image classification and enhanced semantic analysis, understanding, and reasoning.

Technically, the actual detection is performed on the image as a whole, although an object extraction phase may be executed in advance on the client in order to extract a set of labels for each single object. In this case, each object should be uploaded as an independent image. However, this may lead to lower-quality results if the resolution isn’t high enough, or if the object context is more relevant than the object itself — for the application’s purpose.

So what do label annotations look like?

The API returns something very similar to the JSON structure above for each uploaded image. Each label is basically a string (the description field) and comes with a relevance score (0 to 1) and a Knowledge Graph reference.

You can specify how many labels the API should return at request time (3 in this case) and the labels will be sorted by relevance. I could have asked for 10 labels and then thresholded their relevance score to 0.8 in order to consider only highly relevant labels in my application (in this case only two labels would have been used).

Here is an example of the labels given by the Google Vision API for the corresponding image:


The returned labels are: “desk, room, furniture, conference hall, multimedia, writing.”

The first label – “desk” – had a relevance score of 0.97, while the last one – “writing” – had a score of 0.54.

I have programmatically appended the annotations to the input image (with a simple Python script). You can find more Label Detection examples on this public gist.

Personally, I found the detection accurate on every image I uploaded. In some cases though, no labels were returned at all, and very few labels sounded misleading even with a relevance score above 0.5.

Text Detection – OCR as a Service

Optical Character Recognition is not a new problem in the field of image analysis, but it often requires high-resolution images, very little perspective distortion and an incredibly precise text extraction algorithm. In my personal experience, the character classification step is actually the easiest one, and there are plenty of techniques and benchmarks in the literature.

In the case of the Google Vision API, everything is encapsulated in a REStful API that simply returns a string and its bounding box. As I would have expected from Google, the API is able to recognize multiple languages, and will return the detected locale together with the extracted text.

Here is an example of a perfect extraction:

Google Vision API - Text Detection (OCR)

The API response looks very similar to the following JSON structure:

Only one bounding box is detected, in the English language, and the textual content is even split by break lines.

I have been running a few more examples in which much more text was detected in different areas of the image, but it was all collapsed into a single (and big) bounding box, where each text was separated by a break line. This doesn’t make the task of extracting useful information easy, and the Google team is already gathering feedback about this on the official limited preview Google Group (of which I’m proud to be a part).

What about handwritten text and CAPTCHAs?

Apparently, the quality is not optimal for handwritten text, although I believe it’s more than adequate for qualitative analysis or generic tasks, such as document classification.

Here is an example:

Google Vision API - OCR handwritten

With the corresponding extracted text:

water ran down
her hair and clothes it ran
down into the toes Lof her
shoes and out again at the
heels. And het she said
Mwas a real princess

As I mentioned, it’s not perfect, but it would definitely help most of us a lot.

On the other hand, CAPTCHAs recognition is not as easy. It seems that crowdsourcing is still a better option for now. ;)

Face Detection – Position, orientation, and emotions

Face detection aims at localizing human faces inside an image. It’s a well-known problem that can be categorized as a special case of a general object-class detection problem. You can find some interesting data sets here.

I would like to stress two important points:

  • It is NOT the same as Face Recognition, although the detection/localization task can be thought of as one of the first steps in the process of recognizing someone’s face. This typically involves many more techniques, such as facial landmarks extraction, 3D analysis, skin texture analysis, and others.
  • It usually targets human faces only (yes, I have tried primates and dogs with very poor results).

If you ask the Google Vision API to annotate your images with the FACE_DETECTION feature, you will obtain the following:

  • The face position (i.e. bounding boxes);
  • The landmarks positions (i.e. eyes, eyebrows, pupils, nose, mouth, lips ears, chin, etc.), which include more than 30 points;
  • The main face orientation (i.e. roll, pan, and tilt angles);
  • Emotional likelihoods (i.e. joy, sorrow, anger, surprise, etc), plus some additional information (under exposition likelihood, blur likelihood, headwear likelihood, etc.).

Here is an example of face recognition, where I have programmatically rendered the extracted information on the original image. In particular, I am framing each face into the corresponding bounding boxes, rendering each landmark as a red dot and highlighting the main face orientation with a green arrow (click for higher resolution).

Google Vision API - Face Detection

As you can see, every face is correctly detected and well localized. The precision is pretty high even with numerous faces in the same picture, and the orientation is also accurate.

In this example, just looking at the data, we might infer that the picture contains 5 happy people who are most likely facing something or someone around the center of the image. If you complement this analysis with label detection, you would obtain “person” and “team” as most relevant labels, which would give your software a pretty accurate understanding of what is going on.

You can find more face detection example on this public gist.

Alpha testing Conclusions

Although still in limited preview, the API is surprisingly accurate and fast: queries takes just milliseconds to execute. The processing takes longer with larger images, mostly because of the upload time.

I am looking forward to the Google Cloud Storage integration, and to the many improvements already suggested on the Google Group by the many active alpha testers.

I didn’t focus much on the three remaining features yet – geographic landmarks detection, logo detection and safe search detection – but the usage of such features will probably sound straightforward to most of you. Please feel free to reach out or drop a comment if you have doubts or tests suggestions.

You can find all the Python utilities I used for these examples here.

If you are interested in Machine Learning technologies and Google Cloud Platform, you may want to have a look at my previous article about Google Prediction API as well.

  • Thanks for this post. For those who haven’t seen the official Google Cloud Vision API Video – here it is – Also, if you haven’t already heard about the biggest gathering of Google Cloud Developers is taking place in San Francisco on 23-24 March. The event is call GCP NEXT 2016. The 2 day event has keynote talk from Diane Greene, Sundar (CEO), 30+ Technical tracks, Code Labs and more. Check out the agenda here –

  • Very informative and interesting

  • logical Octopus

    I have played around quite a bit with the vision API and it is pretty interesting what it can and can’t identify. The logo detection feature also identifies well known images such as paintings like the mona lisa or starry night. Google does have the technology to identify people by their faces as well (recognition not juts detection) if you’ve ever used Picasa you might’ve seen that in action, but they don’t seem to have built it in to this API.

    • Thank you @logical_octopus:disqus, I’m glad you had fun with this API as well, and I am quite impressed too. Now the API is officially in beta and I’m sure it will improve a lot during 2016.

      As far as face recognition, I don’t think Google Vision API will have this feature any time soon, probably due to privacy and security reasons.

      Google might implement something like face matching though, so that you could programmatically check whether faces extracted from different images correspond to the same person or not. That would be a nice feature, and I can see plenty of use cases for it. What do you think? CC @@mvozzo:disqus

  • swetha marla

    This was very helpful in the project in am working on.Thanks a lot!Its pretty interesting and this is where my area of interest lies !

  • Raj Trivedi

    Can we group similar faces together like google photos app using the API?

    • Hi @raj_trivedi:disqus, the Google Vision API doesn’t provide this kind of feature yet.

      In order to determine whether two detected faces match or not, you could use Google Vision API to extract the facial features (i.e. landmarks positions) and then implement your own matching algorithm, eventually using some image processing library such as OpenCV.

  • Dominic Carillo

    Can anyone walk me through Google Vision Api? We need it for our On-The-Job Training project. Our project focuses on the Crowd Sentiment. So what do we do next if we already have a video? How will we feed it to Google? PLEASE!

    • Hi @dominiccarillo:disqus,
      unfortunately you can’t directly feed a video to the Google Vision API for now, but the team is already working on it. I will keep you posted as soon as they publish an official ETA.

      In the meanwhile, you would need to manually extract frames from your video and annotate them as individual images. Eventually, you can annotate multiple images within the same API request, as in a batch. I would also recommend to develop some “smart” logic to extract interesting frames from your video – based on some differential technique, maybe? – since uploading every frame would probably be too much data to deal with.

      • Dominic Carillo

        Hi, I’m now currently working on it especially on Facial detection. I’m using javascript. Is there a way for it to return an image aside from this result?

        • Hi @dominiccarillo:disqus, the API doesn’t return any image at the moment. What kind of image would you like to be returned? If you simply need some results visualization – such as bounding boxes and facial features locations – you might render them directly on the browser.

          Here are some Python utilities to parse the JSON response and render the results on the input image. It’s not JavaScript, but you could implement the very same logic in JS, or use a simple server-side script to call Google Vision and perform the rendering.

          • Dominic Carillo

            But Java-eclipse makes a new picture with the faces enclosed. Here is an example that we had. We are using javascript now because it also identifies the sentiment, and that is really our goal, to identify the sentiments of the people in a picture.

          • You can call the Google Vision API with Java as well (or any other server-side language), and obtain a sentiment likelihood for each detected face.

            The problem of doing this with JS (i.e. in the browser) is that you would need to handle the oAuth flow for each final user, and rendering the results on your input image would be a bit more complicated. If you do it server-side, it would be much easier to cache the rendered images too, without hitting the Vision API all the time.

  • Truong

    I am interested in TEXT_DETECTION of Google Vision API, it works impressively. But it seems that TEXT_DETECTION only gives exactly result when the text is in English. In my case, i want to use TEXT_DETECTION in a quite narrow context, for example detection text on ads banners in specific language (not English). Can i train the machine on my own data collection to get more exactly result? And how to implement this? Thank you.

    • Hi @disqus_ZkRCuqd3tp:disqus, unfortunately you can’t train the model on your own data and it seems you can’t specify the target locale either.

      Hopefully, Google will allow you to provide more context to your images, such as a desired locale or a specialised vocabulary.

      • Truong

        Thank much for you response, hope that Google will allow that soon :)

  • mahmood zaman

    I am trying to use to vision api to tag the objects of my interest in an industrial environment and as of now google vision api not giving me the required result for all cases so is there a way i can leverage these api to build my own model. any thoughts on it!

    • Hi Mahmood, that’s a good question.

      Unfortunately, you cannot customize the model or use your own labels. Eventually, you could submit the same image at different resolutions and rotations in case no labels are returned, but that won’t always guarantee better results.

  • Erlangga

    Nice article i have plan to develop pest farming identification based on mobile app. User can upload an image of pest and system will be identification what kind of a pest, can I use Google Vission API to do this project? And what kind of services can I use? labels or what?

    • Hi @disqus_Wk5vfARg7B:disqus, that is a nice scenario, but I am afraid Google Vision API won’t be able to help.

      Unfortunately, you cannot customize the set of labels the service is able to recognize. Which means you wouldn’t be able to classify which type of pest is in the picture.

      You will need to build your own dataset – say, a few hundred pictures or more, based on how many classes you have – and then train an ad-hoc classifier.

  • Hi Salik,
    Yes, that is possible. Each likelihood is independent and could have a different value.

  • Hi @disqus_1uIHDWeQia:disqus,
    I’m glad you enjoyed the article, and I really like your use case!

    Unfortunately, the “crop hints” feature is not completely related to the “labels detection” one. Indeed, Google Vision provides object detection, but it does not provide object localization. That would be a great API improvement, especially for use cases similar to yours.

    In the meantime, you may combine the labels detection, face detection, and crop hints functionality to provide an MVP to your users. By extracting and displaying object names, crop hints, and face positions all together, you might still capture most of the relevant information from the image.