How to Deploy Apache Storm on AWS with Storm-Deploy

The growing pressures of dealing with data at massive scale require purpose-built solutions. Meet Apache Storm and Apache Kafka.

In our hyper-connected world, countless sources generate real-time information 24 hours per day. Rich streams of data pour in from logs, Twitter trends, financial transactions, factory floors, click streams, and much more, and developing the ability to properly handle such volumes of high-velocity and time-sensitive data demands special attention. Apache Storm and Kafka can process very high volumes of real-time data in a distributed environment with a fault-tolerant manner.

Who Uses Storm and Kafka – And Why You Should Care

Both Storm and Kafka are top-level Apache projects currently used by various big data and cloud vendors. While Apache Storm offers highly scalable stream processing, Kafka handles messages at scale. In this post, we’ll see how both technologies work seamlessly together to form the bedrock of your real-time data analysis pipeline. You’re going to learn the basics of Apache Storm how to quickly install it in the AWS cloud environment.

A Gentle Introduction to the Apache Storm Project

By design, Apache Storm is very similar to any other distributed computing framework like Hadoop. However, while Hadoop is used for batch or archival data processing, Storm provides a framework for streaming data processing. Here are the components of a Storm cluster:
Apache Storm - nodes

(Storm cluster components)

Nimbus Node

The Nimbus Service runs on the master node (Like Job Tracker in Hadoop). The task of Nimbus is to distribute code around the cluster, assign tasks to servers, and monitor for cluster failures.

ZooKeeper Nodes

Nimbus relies on the Apache ZooKeeper service to monitor message processing tasks. The worker nodes update their task status in Apache ZooKeeper.

Supervisor Nodes

The worker nodes that do the actual processing run a daemon called Supervisor. Supervisor receives and manages worker processes to complete the tasks assigned by Nimbus.

There are a few more Apache Storm-related terms with which we should be familiar before we actually get started:

  • Topology: The workflow or the computation job graph is called topology. Much like MapReduce jobs in Hadoop, Apache Storm uses topologies. Each node in a topology has specific tasks, and the link between nodes draws the graph of how computational jobs flow from source to final output.
  • Tuple. A unit of data.
  • Streams. Unbounded sequences of tuples. Streams are raw information sources that are processed by Spouts and Bolts to generate specific output. For example, a Twitter Stream could be processed by Spouts and Bolts to generate top trending topics.
  • Spouts. A Spout is the source of a stream. In our earlier example, a spout is connected to a Twitter API to capture a stream of tweets. Alternatively, a Spout can be connected to a messaging system to read incoming messages and pass them forward to be processed by Bolts.
  • Bolts. Bolts consume input from Spouts and produce outputs or input for another Bolt. Bolts are the workers in the Storm topology, containing the smallest units of processing logic in a topology. Bolts run functions, filter, aggregate or join data, filter tuples and talk to data stores like databases.
Apache Storm - topology

(Storm topology)

Manually Deploying Apache Storm on AWS

Setting up Apache Storm in AWS (or on any virtual computing platform) should be as easy as downloading and configuring Storm and a ZooKeeper cluster. The Apache Storm documentation provides excellent guidance. In this blog post, however, we’re going to focus on storm-deploy – an easy to use tool that automates the deployment process.

Deploying Apache Storm on AWS using Storm-Deploy

Deploying with storm-deploy is really easy. Storm-deploy is a github project developed by Nathan Martz, the creator of Apache Storm. Storm-deploy automates both provisioning and deployment. In addition, it installs the Ganglia interface for monitoring disk, CPU, and network usage. I will assume that you already have an AWS account with privileges for AWS EC2 and S3.
You will need to install a Java version greater than jdk-1.6 from your workstation. Storm-deploy is built on top of jclouds and pallet. Apache jclouds is an open-source, multi-cloud toolkit for the Java platform that lets you create applications that are portable across clouds. Similarly, Pallet is used to provision and maintain servers on cloud and virtual machine infrastructure, by providing a consistently configured running image across a range of clouds. It is designed for use from the Clojure REPL (Read-Eval-Print Loop), from Clojure code, and from the command line. I simply followed the storm-deploy github project documentation. The commands are universal.
Now let’s see how all this is really done.

Step-1: Generate password-less key pairs:

ssh-keygen -t rsa

Step-2:  Install leningen-2. Leiningen is used for automating Clojure project tasks such as project creation and dependency downloads. All you will need to do here is to download the script:

wget https://raw.github.com/technomancy/leiningen/stable/bin/lein

Step-3: Place the script it on your path, say /usr/local/bin, and make it executable.

chmod +x /usr/local/bin/lein

 Step-4:  Clone storm-deploy using git.

git clone https://github.com/nathanmarz/storm-deploy.git

Step-5: Download all dependencies by running:.

lein deps

Step-6: Create a ~/.pallet/config.clj file in your home directory and configure it with the credentials and details necessary to launch and configure instances on AWS.

(defpallet
  :services
  {
   :default {
             :blobstore-provider "aws-s3"
             :provider "aws-ec2"
             :environment {:user {:username "storm"  ; this must be "storm"
                                  :private-key-path "$YOUR_PRIVATE_KEY_PATH$"
                                  :public-key-path "$YOUR_PUBLIC_KEY_PATH$"}
                           :aws-user-id "$YOUR_USER_ID$"}
             :identity "$YOUR_AWS_ACCESS_KEY$"
             :credential "$YOUR_AWS_ACCESS_KEY_SECRET$"
             :jclouds.regions "$YOUR_AWS_REGION$"
             }
    })

The parameters are as follows:

  • PRIVATE_KEY_PATH: Path to private rsa key, e.g. ~/.ssh/id_rsa.
  • PUBLIC_KEY_PATH: Path to public rsa key, e.g. ~/.ssh/id_rsa.pub. For Linux, you should have a null passphrase on the keys.
  • AWS_ACCOUNT_ID: Can be found on the AWS site, under Security Credentials and Account Identifiers. The ID uses the format XXXX-XXXX-XXXX.
  • AWS_ACCESS_KEY: Your AWS Access Key.
  • AWS_ACCESS_KEY_SECRET: Your AWS Access Secret Key. This can be downloaded from the AWS IAM dashboard.
  • AWS_REGION: The region in which the instances will be launched.

Step-7: There are two very important configurable files. Here the path is for my set up: /opt/chandandata/storm-cluster/conf/clusters.yaml. The first file, clusters.yaml, looks like this:

################################################################################
# CLUSTERS CONFIG FILE
################################################################################
nimbus.image: "us-east-1/ami-d726abbe"         #64-bit ubuntu
nimbus.hardware: "m1.large"
supervisor.count: 2
supervisor.image: "us-east-1/ami-d726abbe"         #64-bit ubuntu on eu-east-1
supervisor.hardware: "m1.large"
#supervisor.spot.price: 1.60
zookeeper.count: 1
zookeeper.image: "us-east-1/ami-d726abbe"         #64-bit ubuntu
zookeeper.hardware: "m1.large"

All of the properties are self-explanatory. We have set the cluster to lunch instances in us-east-1 with an AMI ID:  ami-d726abbe and an instance size of m1.large. (This might be an odd configuration file. It is normally better to use m3 or m4 series instances). You can configure the hardware and images and select a spot pricing maximum – if that’s what you want.
Step-8: The other file, storm.yaml, is where you can set storm-specific configurations.
Step-9:  Launch the cluster:

lein deploy-storm --start --name cluster-name [--branch {branch}] [--commit {commit tag-or-sha1}]
  • The –name parameter names your cluster so that you can attach to it or stop it later. If you omit –name, it will default to “dev”. The name must be in lower-case.
  • The –branch parameter indicates which branch of Storm to install. If you omit –branch, it will install Storm from the master branch.
  • The —commit parameter allows a release tag or commit SHA1 to be passed. If you omit –commit you will get the latest commit from the branch you are using.
  • Example: lein deploy-storm –start –name cp-cluster –branch master –commit 0.9.1
  • This will also install with the latest version:
lein deploy-storm --start --name cp-cluster
  • The deploy sets up Nimbus, the Supervisors, and ZooKeeper launches the Storm UI on port 8080 on Nimbus and launches a DRPC server on port 3772 on Nimbus.
  • In case of errors, log into the instances with ssh storm@<IP_OF_MACHINE>, and check the logs in ~/storm/logs.

This will run the cluster with appropriate settings. Before submitting an actual storm job, you should know about just a few additional tasks that can be performed.

Operations on an Apache Storm Cluster

  • To attach to an Apache Storm cluster, run the following:
lein deploy-storm --attach --name cp-cluster

Attaching to a cluster configures your storm client (which is used to start and stop topologies) to talk to that particular cluster as well as giving your workstation authorization to view the Storm UI on port 8080 on Nimbus. It writes the location of Nimbus in ~/.storm/storm.yaml so that the storm client knows which cluster to talk to, and authorizes your workstation to access the Nimbus daemon’s Thrift port (which is used for submitting topologies), and to access Ganglia on port 80 on Nimbus.

  • To get the IP addresses of the cluster nodes, run the following:
lein deploy-storm --ips --name mycluster
  • To access the ganglia browser UI, go to: http://{nimbus ip}/ganglia/index.php
  • To stop the cluster, run:
lein deploy-storm --stop --name cp-cluster

Submit an Apache Storm Job

You can download Apache Storm here. At the time of this writing, the latest Storm version was 0.9.6. We have attached our workstation to the storm cluster so we are ready to run a Storm job.

  • Go to storm-0.9.6/bin and run:
storm jar PATH_TO_JAR.JAR JOB_CLASSNAME
  • Check the storm UI <NIMBUS_HOST:8080> to check the job status.
  • To kill the job, run:
storm kill TOPOLOGY_NAME

Next Steps: Harnessing the Combined Power of Storm and Kafka

In this post, you’ve been introduced to Apache Storm and how to easily deploy a storm cluster using storm-deploy. But as I mentioned earlier, Storm on its own is not capable of large scale real-time data processing. We need to add Kafka, which acts as the distributed messaging service for storm topology. We will discuss Kafka integration and compare it with Amazon Kinesis in a later post. Stay tuned!

Avatar

Written by

Chandan Patra

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices. Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service


Related Posts

Vijayakumar Athithan
Vijayakumar Athithan
— March 27, 2020

What is Cognito in AWS?

Web applications usually allow a valid username and password combination for successful sign in to the application. Modern authentication flows incorporate more approaches to ensure user authentication. When using AWS, this is no exception, thanks to the abilities and features offered b...

Read more
  • AWS
  • AWS Cognito
  • Solutions Architect
Connie Benton
Connie Benton
— March 25, 2020

How To Build a Career with AWS Certifications

From Iaas and PaaS solutions to digital marketing, cloud computing reshapes the world of technology. As the influence of this technology grows, so does investment. Tens of billions of dollars are being spent on cloud computing-related services each year. This influx is continuing to inc...

Read more
  • AWS
  • Certifications
Avatar
Andrew Larkin
— March 20, 2020

The 12 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...

Read more
  • AWS
  • AWS Certifications
Alisha Reyes
Alisha Reyes
— March 17, 2020

Cloud Academy’s Blog Digest: How Do AWS Certifications Increase Your Employability, How to Become a Microsoft Certified Azure Data Engineer, and more

With everything going on right now, it's likely that the only thing you've been reading lately is related to the coronavirus pandemic. It's important to stay informed during these times, but it's also good to jump into something that can take your mind off of the current situation for j...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Avatar
Cloud Academy Team
— March 13, 2020

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— March 7, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— March 6, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Patrick Navarro
Patrick Navarro
— March 4, 2020

AWS Certifications: How Do They Increase Your Employability and Progress Your Career?

AWS certifications are no walk in the park. They’re designed to validate in-depth, specialist knowledge and comprehensive experience, often requiring months of dedicated studying to earn even for those already working with the cloud platform. But the rewards that AWS professionals ca...

Read more
  • AWS
  • AWS certification
  • certification
Avatar
Chandan Patra
— February 21, 2020

Elasticsearch vs. CloudSearch: AWS Cloud Search Choices

Elasticsearch vs. CloudSearch: What's the main difference? Let's compare AWS-based cloud tools: Elasticsearch vs. CloudSearch. While both services use proven technologies, Elasticsearch is more popular, open source, and has a flexible API to use for customization; in comparison, CloudS...

Read more
  • AWS
  • Azure
  • cloudsearch
  • elasticsearch
Avatar
Andrew Larkin
— February 13, 2020

Cloud Academy Content Roadmap Updates

Welcome to our Q1 2020 roadmap. This is the content we plan to build over the next three months, between February 1 - and April 30, 2020. Let's look at some of our roadmap highlights. Atlassian Bamboo for CI/CD We had a lot of requests for practical guides on how to apply DevOps tool...

Read more
  • Artificial Intelligence
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Alisha Reyes
Alisha Reyes
— February 7, 2020

New on Cloud Academy: Git Labs, CKA and CKAD Lab Challenges, AWS and Azure Learning Paths, AGILE, and Much More

We just kicked off our first Free Weekend of 2020. This means we've unlocked our Training Library for just 72 hours. Until Sunday at 11:59 pm (PST), you can get unlimited access to our industry-leading learning paths, courses, certification prep exams, and our most popular hands-on labs...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Avatar
Stuart Scott
— February 6, 2020

How to Encrypt an EBS Volume

Keeping data and applications safe in the cloud is one of the most visible challenges facing cloud teams in 2020. Cloud storage services where data resides are frequently a target for hackers, not because the services are inherently weak but because they are often improperly configured....

Read more
  • AWS
  • EBS
  • Encryption