What is Amazon Machine Learning? How to Get Started
The phrase machine learning seems to appear alongside every new technology or service. Despite its popularity, many people still don’t understand e...Learn More
Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the technical aspects that differentiate the two services.
Quality will be evaluated more objectively with the support of data. Finally, the cost analysis will be modeled on real-world scenarios and based on the publicly available pricing.
The following table recaps the main high-level features and corresponding support on both platforms.
|Feature||Google Cloud Vision||Amazon Rekognition|
|Object Detection (labels)||✓||✓|
|Sentiment Detection (faces)||✓||✓|
|Object Detection (location)|
|Image Search||only faces|
Given the limited overlapping of the available features, we will focus on Object Detection, Face Detection, and Sentiment Detection.
Here, we will discuss how both services manage input data and outcoming results. We will focus on the types of data that can be used as input and the supported ways for providing APIs with input data.
Both services only accept raster image files (i.e. not based on vector graphics). Videos and animated images are not supported, although Google Cloud Vision will accept animated GIFs and consider only the first frame.
Amazon Rekognition’s support is limited to JPG and PNG formats, while Google Cloud Vision currently supports most of the image formats used on the Web, including GIF, BMP, WebP, Raw, Ico, etc.
Additional SVG support would be useful in some scenarios, but for now, the rasterization process is delegated to the API consumer.
Tests have not revealed any performance or quality issues based on the image format, although lossy formats such as JPEG might show worse results at very low resolutions (i.e. below 1MP).
Both Google Cloud Vision and Amazon Rekognition provide two ways to feed the corresponding API:
The first method is less efficient and more difficult to measure in terms of network performance since the body size of each request will be considerably large. Despite its efficiency, the Inlined Image enables interesting scenarios such as web-based interfaces or browser extensions where Cloud Storage capabilities might be unavailable or even wasteful.
On the other hand, the Cloud Storage alternative allows API consumers to avoid network inefficiency and reuse uploaded files. Please note the following details related to Cloud Storage:
Neither Vision nor Rekognition accept external images in the form of arbitrary URLs. Such integration would simplify some use cases. It could be added as a third data source, although at a higher cost due to the additional networking required.
Processing multiple images is a common use case, eventually even concurrently. Batch support is useful for large datasets that require tagging or face indexing and for video processing, where the computation might exploit repetitive patterns in sequential frames. In addition to the obvious computational advantages, such information would also be useful for object tracking scenarios.
For now, only Google Cloud Vision supports batch processing.
Vision’s batch processing support is limited to 8MB per request. Therefore, a relatively large dataset of 1,000 modern images might easily require more than 200 batch requests. Also, the API is always synchronous. This means that once you have invoked the API with N requests, you have to wait until the N responses are generated and sent over the network.
A batch mode with asynchronous invocations would probably make size limitations softer and reduce the number of parallel connections. At the same time, it would shrink the number of API calls required to process large sets of images.
Videos are not natively supported by Google Cloud Vision or Amazon Rekognition.
If we think of a video as a sequence of frames, API consumers would need to choose a suitable frame rate and manually extract images before uploading them to the Cloud Storage service.
Within AWS, API consumers may use Amazon Elastic Transcoder to process video files and extract images and thumbnails into S3 for further processing. Such a solution would be fully integrated into the AWS console, as Elastic Transcoder is part of the AWS suite.
On the other hand, GCP offers media solutions through official partners that are based on Google’s global infrastructure such as Zencoder, Telestream, Bitmovin, etc.
Native video support would definitely make things easier, and it would open the door to new video-related functionalities such as object tracking, video search, etc.
Both APIs accept and return JSON data that is passed as the body of HTTP POST requests.
While Google Cloud Vision aggregates every API call in a single HTTP endpoint (images:annotate), Amazon Rekognition defines one HTTP endpoint for each functionality (DetectLabels, DetectFaces, etc.).
Although AWS’s choice might seem more intuitive and user-friendly, the design chosen by Google makes it easy to run more than one analysis of a given image at the same time since you can ask for more than one annotation type within the same HTTP request.
API response sizes are somewhat similar for both platforms. Labeling responses with less than 10 labels always weigh less than 1KB, while each detected face always weighs less than 10KB.
The Object Detection functionality of Google Cloud Vision and Amazon Rekognition is almost identical, both syntactically and semantically. The API always returns a list of labels that are sorted by the corresponding confidence score. Vision’s responses will also contain a reference to Google’s Knowledge Graph, which can be useful for further processing of synonyms, related concepts, and so on.
Obviously, each service is trained on a different set of labels, and it’s difficult to directly compare the results for a given image.
That’s why we made our quality and performance analysis on a small, custom dataset of 20 images, organized into four size categories:
Each category contains five images with a random distribution of people, objects, indoor, outdoor, panoramas, cities, etc. The categorization is used to identify quality or performance correlations based on the image size/resolution.
Only the first 10 labels above 70% confidence are reported for each service.
Since the majority of labels turned out to be relevant, wrong or misleading labels have been highlighted in red. Labels in common between the two services have been highlighted in bold. Particularly relevant labels related to specific objects or activity detection have been highlighted in green.
[ngg_images source=”galleries” container_ids=”1″ sortorder=”1,2,5,3,4″ display_type=”photocrati-nextgen_basic_thumbnails” override_thumbnail_settings=”0″ thumbnail_width=”240″ thumbnail_height=”160″ thumbnail_crop=”1″ images_per_page=”20″ number_of_columns=”0″ ajax_pagination=”1″ show_all_in_lightbox=”0″ use_imagebrowser_effect=”0″ show_slideshow_link=”0″ slideshow_link_text=”[Show slideshow]” order_by=”sortorder” order_direction=”ASC” returns=”included” maximum_entity_count=”500″]
Based on our sample, Google Cloud Vision seems to detect misleading labels much more rarely, while Amazon Rekognition seems to be better at detecting individual objects such as glasses, hats, humans, or a couch.
Overall, Vision detected 125 labels (6.25 per image, on average), while Rekognition detected 129 labels (6.45 per image, on average). Despite the lower number of labels, 93.6% of Vision’s labels turned out to be relevant (8 errors). Only 89% of Rekognition’s labels were relevant (14 errors).
It’s worth mentioning that Amazon Rekognition often clusters three equivalent labels together (“People”, “Person”, and “Human”) whenever a human being is detected in the image. By collapsing such labels into one, the total number of detected labels is 111 and the relevance rate goes down to 87.3%.
The following table compares the results for each sub-category.
|Category||Google Cloud Vision||Amazon Rekognition|
| 35 labels|
Despite a lower relevance rate, Amazon Rekognition always managed to detect at least one relevant label for each image. Instead, Google Cloud Vision failed in two cases by providing either no labels above 70% confidence or misleading labels with high confidence.
Please note that the reported relevance scores can only be taken in relation to the considerably small dataset and are not meant to be universal precision rates. By increasing the dataset size, relevance scores will converge to a more meaningful result, although even partial data show a consistent predominance of Google Cloud Vision. On the other hand, Amazon Rekognition seems to be more coherent regarding the number of detected labels and appears to be more focused on detecting individual objects.
One additional note related to rotational invariance: Non-exhaustive tests have shown that Google Cloud Vision tends to perform worse when the images are rotated (up to 90°). On the other hand, the set of labels detected by Amazon Rekognition seems to remain relevant, if not identical to the original results.
There were a few cases where both APIs detected nonexistent faces, or where some real faces were not detected at all, usually due to low-resolution images or partially hidden details.
Given the fairly high precision of both APIs, here we report only misleading cases detected in the original dataset, plus some additional images specifically used for facial analysis.
A few particularly interesting cases are highlighted in bold.
[ngg_images source=”galleries” container_ids=”2″ display_type=”photocrati-nextgen_basic_thumbnails” override_thumbnail_settings=”0″ thumbnail_width=”240″ thumbnail_height=”160″ thumbnail_crop=”1″ images_per_page=”20″ number_of_columns=”0″ ajax_pagination=”1″ show_all_in_lightbox=”0″ use_imagebrowser_effect=”0″ show_slideshow_link=”0″ slideshow_link_text=”[Show slideshow]” order_by=”sortorder” order_direction=”ASC” returns=”included” maximum_entity_count=”500″]
Both services show detection problems whenever faces are too small (below 100px), partially out of the image, or occluded by hands or other obstacles. While the first two scenarios are intrinsically difficult because of missing information, the third case might improve over time with a more specialized pattern recognition layer.
Amazon Rekognition seems to have detection issues with black and white images and elderly people, while Google Cloud Vision seems to have more problems with obstacles and background/foreground confusion.
Also, Amazon Rekognition managed to detect unexpected faces, either faces that did not exist or those related to animals or illustrations. Illustrations and computer-generated images are special cases and both APIs haven’t been properly trained to manage them. However, they are probably not in the scope of most end-user applications. On the other hand, animals are not officially supported by either Vision or Rekognition, but Rekognition seems to have more success with primates, whether it’s intentional or not.
As well as for Object Detection, Amazon Rekognition has shown a very good rotational invariance. Although it’s not perfect, Rekognition’s results don’t seem to suffer much for completely rotated images (90°, 180°, etc.), while Vision stops performing well when you get close to a 90° rotation. Amazon’s deep learning models must have been intentionally trained to achieve rotational invariance, which is a particularly desirable feature for many scenarios.
We didn’t focus on other accuracy parameters such as location, direction, special traits, and gender (Vision doesn’t provide such data). Further work and a considerable dataset expansion may provide useful insight about face location and direction accuracy, although the difference of a few pixels is usually negligible for most applications.
Although both services can detect emotions, which are returned as additional landmarks by the face detection API, they were trained to extract different types of emotions, and in different formats.
Google Cloud Vision can detect only four basic emotions: Joy, Sorrow, Anger, and Surprise. The emotional confidence is given in the form of a categorical estimate with labels such as “Very Unlikely,” “Unlikely,” “Possible,” “Likely,” and “Very Likely.” Such estimates are returned for each detected face and for each possible emotion. If no specific emotion is detected, the “Very Unlikely” label will be used.
Amazon Rekognition can detect a broader set of emotions: Happy, Sad, Angry, Confused, Disgusted, Surprised, and Calm. It also identifies an additional “Unknown” value for very rare cases that we did not encounter during this analysis. The emotional confidence is given in the form of a numerical value between 0 and 100.
To evaluate both platforms’ precision in emotion detection, only images with correctly detected faces were selected. In total, the dataset includes 25 images. The last six images contain more than one person, ranging from 2 to 7 faces in the same image.
The following images have been tagged with all of the predicted emotions. Wrong or misleading values have been highlighted in red, while particularly accurate or confident values have been highlighted in green. The cases where Google Cloud Vision did not detect any emotion have been highlighted in bold.
[ngg_images source=”galleries” container_ids=”3″ display_type=”photocrati-nextgen_basic_thumbnails” override_thumbnail_settings=”0″ thumbnail_width=”240″ thumbnail_height=”160″ thumbnail_crop=”1″ images_per_page=”25″ number_of_columns=”0″ ajax_pagination=”1″ show_all_in_lightbox=”0″ use_imagebrowser_effect=”0″ show_slideshow_link=”0″ slideshow_link_text=”[Show slideshow]” order_by=”sortorder” order_direction=”ASC” returns=”included” maximum_entity_count=”500″]
The limited emotional range provided by Google Cloud Vision doesn’t make the comparison completely fair. Psychological studies have shown that human behavior can be categorized into six globally accepted emotions: happiness, sadness, fear, anger, surprise, and disgust. The emotional set chosen by Amazon is almost identical to these universal emotions, even if Amazon chose to include calmness and confusion instead of fear.
This limitation is even more important when considering the wide range of emotional shades often found within the same image. Deciding whether a face is happy or surprised, angry or confused, sad or calm can be a tough job even for humans. A sentiment detection API should be able to detect such shades and eventually provide the API consumer with multiple emotions and a relatively granular confidence. Amazon Rekognition seems to behave this way.
Even when a clear emotion is hardly detectable, Rekognition will return at least two potential emotions, even with a confidence level below 5%. On the other hand, Vision is often incapable of detecting any emotion at all. This is partially due to the limited emotional range chosen by Google, but it also seems to be an intrinsic training issue. Overall, Amazon Rekognition seems to perform much better than Google Cloud Vision.
The following table summarizes the platforms’ performance for emotion detection.
|Images with …||Google Cloud Vision||Amazon Rekognition|
|[…] at least 1 emotion||60%||100%|
|[…] at least 2 emotions||8%||92%|
|[…] at least 3 emotions||0%||76%|
|[…] at least 1 correct emotion||32%||76%|
|[…] no wrong emotions||36%||64%|
Both services do not require any upfront charges, and you pay based on the number of images processed per month. Since Vision’s API supports multiple annotations per API call, the pricing is based on billable units (e.g. one unit of Object Detection, one unit for Face Detection, etc.). Also, both services include a free usage tier for small monthly volumes.
While Google Cloud Vision is more expensive, its pricing model is also easier to understand. Here is a mathematical and visual representation of both pricing models, including their free usage (number of monthly images on the X-axis, USD on the Y-axis).
Although both services offer free usage, it’s worth mentioning that the AWS Free Tier is only valid for the first 12 months for each account. The Free Tier includes up to 5,000 processed images per month, spanning each Rekognition functionality.
On the other hand, Vision’s free usage includes 1,000 units per month for each functionality, forever.
Given the low volume allowed by both free tiers, such volumes are meant for prototyping and experimenting with the service and will not have any relevant impact on real-world scenarios that involve millions of images per month.
The following charts show a graphical representation of the pricing models, including Vision’s free usage and excluding the AWS Free Tier. The X-axis represents the number of processed images per month, while the Y-axis represents the corresponding cost in USD.
The first three charts show the pricing differentiation for Object Detection, although the first two charts also hold for Face Detection. As mentioned previously, Google’s price is always higher unless we consider volumes of up to 3,000 images without the AWS Free Tier.
The situation is slightly different for Face Detection at very high volumes, where the pricing difference is roughly constant. Above 10M images, Google Cloud Vision is $2,300 more expensive, independently of the number of images (i.e. parallel lines).
Also, we should note that for volumes above 20M, Google might be open to building custom solutions, while Rekognition’s pricing will get cheaper for volumes above 100M images.
Finally, the same pricing can be projected into real scenarios and the corresponding budget. Each scenario is meant to be self-contained and to represent a worst-case estimate of the monthly load.
Note: All of the cost projections described below do not include storage costs.
|Scenario||Google Cloud Vision||Amazon Rekognition|
|Scenario 1||$28.50||$20 (w/o Free Tier)|
$15 (w/ Free Tier)
The AWS Free Tier has been considered only for Scenario 1 since it would not impact the overall cost in the other cases ($5 difference).
It’s worth noting that Scenarios 3-4 and 5-6 cost the same within Amazon Rekognition (as they involve the same number of API calls), while the cost is substantially different for Google Cloud Vision. This is because Object Detection is far more expensive than Face Detection at higher volumes.
Overall, the analysis shows that Google’s solution is always more expensive, apart for low monthly volumes (below 3,000 images) and without considering the AWS Free Tier of 5,000 images.
Amazon Rekognition is a much younger product and it landed on the AI market with very competitive pricing and features. Its sentiment analysis capabilities and its rotation-invariant deep learning algorithms seem to out-perform Google’s solution.
Rekognition also comes with more advanced features such as Face Comparison and Face Search, but it lacks OCR and landmark/logo detection.
Google Cloud Vision is more mature and comes with more flexible API conventions, multiple image formats, and native batch support. Its Object Detection functionality generates much more relevant labels, and its Face Detection currently seems more mature as well, although it’s not quite perfect yet.
Google Cloud Vision’s biggest issue seems to be rotational invariance, although it might be transparently added to the deep learning model in the future. Similarly, sentiment detection could be improved by enriching the emotional set and providing more granular multi-emotion results.
Both services have a wide margin of improvement regarding batch/video support and more advanced features such as image search, object localization, and object tracking (video). Being able to fetch external images (e.g. by URL) might help speed up API adoption, while improving the quality of their Face Detection features will inspire greater trust from users.
Based on the results illustrated above, let’s consider the main customer use cases and evaluate the more suitable solution, without considering pricing:
We’d like to hear from you. What is your favorite image analysis functionality and what do you hope to see next?
Let’s look at what benefits these two new EC2 instance types offer and how these two new instances could be of benefit to you. Both of the new instance types are built on the AWS Nitro System. The AWS Nitro System improves the performance of processing in virtualized environments by...
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...
Security in AWS is governed by a shared responsibility model where both vendor and subscriber have various operational responsibilities. AWS assumes responsibility for the underlying infrastructure, hardware, virtualization layer, facilities, and staff while the subscriber organization ...
Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...
Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...
There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...
Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...
One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...
A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...
The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....
Thousands of cloud practitioners descended on Chicago’s McCormick Place West last week to hear the latest updates around Amazon Web Services (AWS). While a typical hot and humid summer made its presence known outside, attendees inside basked in the comfort of air conditioning to hone th...
Containers can help fragment monoliths into logical, easier to use workloads. The AWS Summit New York was held on July 17 and Cloud Academy sponsored my trip to the event. As someone who covers enterprise cloud technologies and services, the recent Amazon Web Services event was an insig...