Amazon Elasticsearch Service: Is it as Good as a Standalone Installation?
Perhaps surprisingly, Amazon Elasticsearch is hardly overwhelming, coming with a very basic tool kit and an outdated release version. And it's expe...Learn More
In part one of this series, we described what search engines are, how they solve the problem of accessing content stretched across large websites, and how Amazon CloudSearch provides a solution for a cloud environment. AWS CloudSearch is certainly a powerful and appealing service from Amazon. However, there are more popular players in the search engine market, and Elasticsearch ranks right behind Solr as the most popular search and analytics engine. We’ll explore the battle of the Amazon search providers: Elasticsearch vs CloudSearch.
Both Elasticsearch and CloudSearch are provided by Amazon as AWS services. However, Elasticsearch is an independent product developed by elastic.co, which means you can set up Elasticsearch independently by downloading and extracting the tar ball, or through a yum/apt-get install.
Amazon CloudSearch, on the other hand, is fully managed by AWS, which, once you choose your instance type, handles the complete provisioning. Users are able to select High-Availability (AZ level), replication, and partitioning options through the AWS Management Console or AWS CLI.
Elasticsearch is easy to upgrade. The process can be as easy as replacing the lib folder of an older version with a new version.
Updates of Amazon CloudSearch are pushed by AWS, relieving users of the responsibility. However, this might result in delayed upgrades of new releases.
When existing data need to be searchable, they should be imported to the search engines. In Elasticsearch, there are plugins called “rivers” to push data into a cluster. There are many popular river plugins available such as elasticsearch–river- mongodb, elasticsearch-river-couchdb, Elasticsearch-jdbc. However, for various reasons, river plugins are being deprecated.
Logstash Forwarders, are normally used to push logs from application or database servers to Elasticsearch. This makes them available for searching logs or to plot graphs in Kibana. Recently, Logstash and input_changes plugins have taken center stage to replace rivers as tools to push data to Elasticsearch, too. Some of the recently developed input_changes are couchdb_changes, Twitter, and rabbitmq.
In Amazon CloudSearch, data and documents (in either XML or JSON format) are pushed in batches. Data can also be pushed to S3, with the data path given to index the documents.
In Elasticsearch, data is backed up (and restored) using the Snapshot and Restore module. Usually, users are required to define a shared mount path. In the cloud, they can instead opt for Amazon S3, HDFS, or Azure storage. Curator is a tool that acts as a cron job manager that users can set to automate the backup process.
In Amazon CloudSearch, the service itself takes care of the whole backup process, once again sparing users the bother. Unlike Elasticsearch, where users must manually run the restore activity from backed up indexes, CloudSearch does it automatically.
Elasticsearch provides a plugin called shield to handle authentication and authorization. Shield also provides features like encryption, role-based access control, IP filtering, and auditing. However, shield is a licensed product that must be purchased.
You can also integrate your AD server to control access locally.
Amazon CloudSearch provides IAM-based access control.
In Elasticsearch, adding or deleting nodes within a cluster must be done manually. If the cluster instances are upgraded – i.e. vertical scaling – then you’ll need to run through the setup process from scratch. Old data must be backed up and restored to the new cluster. In the case of horizontal scaling, where servers are added or removed from the cluster, cluster rebalancing and resharding are mandatory. These, too, are manual processes. Users need to be very careful during the process.
Amazon CloudSearch, on the other hand, has built-in scaling and upgrade tools. When a server in a CloudSearch service reaches its threshold, it automatically upgrades to the next larger instance type. And when the capacity goes beyond the largest available instance types, the index is partitioned into multiple instances.
In Elasticsearch, there are cluster monitoring tools like Marvel which allow a user to send RESTful queries to check cluster health. Another product called Watcher provides an alerting mechanism. These tools are all provided by Elasticsearch itself. Users can, of course, also bring their own monitoring tools, like SPM or the New Relic plugin for Elasticsearch to keep an eye on their clusters.
Amazon CloudSearch is fully integrated with Amazon Cloudwatch, which can monitor metrics like SuccessfulRequests, Searchable Documents, Index Utilization, and Partition Count. Like Watcher in Elasticsearch, AWS Simple Notification Service (SNS) can be integrated with CloudSearch for alerting.
As they’re both built for running search engines in the cloud, Elasticsearch and CloudSearch are designed for high availability.
Elasticsearch is built for distributed computing where the cluster grows horizontally. The indexes are split into shards and replication factors provide shard redundancy. Whenever a node fails, the replicated shards are used to replace lost data.
Elasticsearch employs a technique called zen discovery, where all the nodes communicate with each other through an “elected” master. In case the master node fails, another node takes over as master.
A similar architecture is followed in CloudSearch to handle failure and provide HA. CloudSearch also has an optional feature for multi-AZ replication within a single region to provide HA and Availability Zone failover.
In Elasticsearch, searching happens on both index and types using a search API. The search API also includes Faceting and Filtering for searching data.
In CloudSearch, users create a search domain which includes sub-services to upload documents. A search service provides the means to search indexed data.
In Elasticsearch, many built-in libraries are provided for analyzers, tokenizers, and filters for indexing.
Amazon CloudSearch, on the other hand, provides a much simpler configuration service for all indexing operations and relevance ranking.
Amazon CloudSearch supports many SDKs along with RESTful API calls. The most popular SDKs are in Java, Ruby, Python, .Net, PHP, and Node.js.
As Elasticsearch requires manual set up, the true cost of deployment must include infrastructure costs, licensing for all non-open source software tools and the OS, and the Elasticsearch binary. This may require a large operational expenditure to cover skilled Elasticsearch admins and a monitoring team.
Amazon CloudSearch is priced according to the search instance size. Here’s an example:
With Multi-AZ enabled, the cost of redundant search instances will also be added. If an index is partitioned, the cost of each new search instance in each AZ is also added to the cost.
Document batch upload costs are $0.10 per 1,000 Batch Upload Requests (the maximum size for each batch is 5 MB).
Re-indexing is required for indexes when a new field is added to the index. The charge for a re-indexing request is $0.98 per GB of data stored in your search domain.
Inbound data transfers are free between Amazon CloudSearch and other AWS Services. There are charges for outbound data transfers:
Both Elasticsearch and Amazon CloudSearch are built on proven technologies and are the choice of many demanding organizations. Because of its flexibility and active developer community, Elasticsearch is more popular. But Amazon CloudSearch scores when it comes to operational efficiency.
Because of its popularity, AWS provides Elasticsearch as a Service (Amazon Elasticsearch Service) which, in many ways, provides the best of both worlds. Elastic.co also provides Elasticsearch as a cloud service Found.
What do you think? Was this helpful for determining the finer points of each service? Comments welcome and appreciated.
AWS's WaitCondition can be used with CloudFormation templates to ensure required resources are running.As you may already be aware, AWS CloudFormation is used for infrastructure automation by allowing you to write JSON templates to automatically install, configure, and bootstrap your ...
As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in the cloud.As the market leader and most mature p...
The announcements at re:Invent just keep on coming! Let’s look at what benefits these two new EC2 instance types offer and how these two new instances could be of benefit to you. If you're not too familiar with Amazon EC2, you might want to familiarize yourself by creating your first Am...
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...
In order to understand AWS VPC egress filtering methods, you first need to understand that security on AWS is governed by a shared responsibility model where both vendor and subscriber have various operational responsibilities. AWS assumes responsibility for the underlying infrastructur...
Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...
Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...
There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...
Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...
One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...
A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...
The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....